SlideShare a Scribd company logo
1 of 21
Download to read offline
Investigate the diversity of extremely 
complex metagenomic samples 
Qingpeng Zhang 
Department of Computer Science and Engineering 
Michigan State University 
Supervisor: Dr. Titus Brown
Outline 
● Significance and background 
– Metagenomics 
– Microbial diversity measurement 
● Preliminary results 
– A novel method to investigate microbial diversity 
based on an efficient k-mer counting approach 
● Proposed research 
– Prove effectiveness using test data sets 
– Tackle extremely large metagenomic data sets 
generated from extremely complex microbial 
samples
The Great Prairie Grand Challenge 
● How many different species in a soil sample? What is their abundance distribution? How 
different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa 
prairie? 
● “Grand Challenge” - extremely large data sets from extremely complex microbial community 
– Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005) 
– In a gram of soil, there are approximately a billion microbial cells, containing an estimated 
4 petabase pairs of DNA (Jack A. Gilbert,2013) 
– Over a tera bases of sequences from Iowa cultivated and uncultivated
Metagenomics and 
Next Generation 
Sequencing
species 
Individuals 
OTUs 
16S rRNAs sequences 
Unique 
k-mers 
total k-mers in 
WGS data 
Diversity 
measurement 
based on different 
unit concepts 
Whole genome 
sequencing reads 
Nature Reviews 
Genetics 6, 805-814, 
ettc. 
97% similarity of 
16S sequences
Statistics for Diversity Estimation 
● rarefaction curve 
– Quite incapable of dealing with the scale of diversity 
of the microbial world 
● extrapolation from curves 
● parametric estimators(need relative species 
abundance) 
● non-parametric estimators(Chao1,etc.) 
– Lower bound estimator 
– Sensitive to underlying distribution
The Goal of this Project 
● Using whole genome shotgun metagenomic data set rather than 16S 
rRNA 
– Measuring the microbial diversity of samples alpha-diversity 
– Comparing microbial samples beta-diversity 
● A novel method that is: 
– Binning-free 
– Assembly-free 
– Annotation-free 
– Reference-free 
● Efficient (Memory and Time) 
– extremely large shotgun metagenomic data sets (Terabytes, etc.) 
– extremely diverse microbial communities (Soil, etc.)
species 
Individuals 
OTUs 
16S rRNAs sequences 
Unique 
k-mers 
total k-mers in 
WGS data 
Diversity 
measurement 
based on different 
unit concepts 
Whole genome 
sequencing reads 
Nature Reviews 
Genetics 6, 805-814, 
ettc. 
97% similarity 
of 16S 
sequences
Preliminary Results 
● A novel method to investigate microbial 
diversity based on an efficient k-mer counting 
approach 
– Diversity measurement of one sample 
– Comparison of multiple samples
An Approach to Count k-mer Efficiently 
● 
● an approach to count k-mer efficiently 
– 
• Highly scalable: Constant memory consuming, 
independent of k and dataset size 
• Probabilistic properties well suited to next 
generation sequencing datasets 
• With certain counting false positive rate as tradeoff 
because of collision
What is khmer 's advantage? 
● Good performance in 
time/memory usage 
● Online counting, updating and 
retrieving (important for this 
project!!) 
● With Python API – flexible and 
expandable 
(Zhang, Pell, Canino-Koning, Howe, & Brown, 
2013,submitted)
median k-mer frequency to represent the sequencing 
coverage of the read 
Using median k-mer frequency rather than 
average k-mer frequency can decrease the influence 
of sequencing error
Mapping and k-mer coverage measures correlate for 
simulated genome data and a real E. coli data set (5m reads). 
(Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)
iGS 
It there are Y reads with a 
sequencing depth of X. In other 
word, for each of those Y reads, 
there are X-1 other reads that 
cover the same DNA segment 
in a genome that single read 
originates. So we can estimate 
that there are Y/X distinct DNA 
segments with reads coverage 
as X. We term these distinct 
DNA segments in species 
genome as IGS(informative 
genomic segment). 
IGS(informative 
genomic 
segment) can 
represent the 
novel information 
of a genome
N =G/(L-k+1) 
1000000/(80-22+1) 
Borrowing statistical methods from OTU based diversity 
analysis, (rarefaction curve, estimators, etc.)
Compare the contents of multiple metagenomics 
samples 
● How different are two samples? 
● 
– 
If sequencing coverage of 
a read from sample A in 
sample B >0, 
the segment in sample A 
that read originates exists 
in sample B
Synthetic datasetsA:(same abundance) 
– SampleA: 100 species with 80 common to B 
– SampleB: 100 species with 80 common to A 
– SampleC: 100 species with 20 common to A/B, and 60 common to D 
– SampleD: 100 species with 20 common to A/B, and 60 common to D 
●
Synthetic datasetB: 
– Sample1A: 
● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2 
– Sample1B: 
● species IDs: 1,2,3,14,15,16,17,18,19,20 relative abundance: 20:18:16:4:3:2:2:2:2:2 
– Sample1C: 
● species IDs: 21,22,3,4,5,6,7,8,9,10 relative abundance: 2:2:2:2:2:3:4:16:18:20 
– A and B high overlap on individual level, low overlap on species level A and C high overlap on species 
level, low overlap on individual level 
– B and C low overlap on species level and low overlap on individual level
What's Next 
● Refi ne the methods 
– Errors are still haunting. 
– More statistics of IGSs(informative 
genomic segment) 
● Prove effectiveness using test data 
sets 
– Simulated data sets based on real 
microbial genomes 
– MetaHIT, 124 metagenomic 
samples from 99 healthy people, 
and 25 patients with inflammatory 
bowel disease (IBD) syndrome. 
Each sample has on average 65 ± 
21 million reads. 
● Integrate functions into khmer package
The Great Prairie Grand Challenge 
● How many different species in a soil sample? What is their abundance distribution? How 
different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa 
prairie? 
● “Grand Challenge” - extremely large data sets from extremely complex microbial community 
– Over a tera bases of sequences from Iowa cultivated and uncultivated 
– Should be prepared to face technical challenge when dealing with such large-scale 
data sets (Storage, Computing, Resource, HPCC, etc.) 
– A preliminary result :The majority of the prairie reads (50%) are present in the corn 
with a coverage of > 0
Acknowledgement 
● Dr. Titus Brown 
● Lab members of GED 
● Dr. Jason Pell 
● Dr. Adina Howe 
● Eric McDonald 
● Everybody in this room

More Related Content

What's hot

QTLNetMiner - Efficient search and prioritization of gene evidence networks
QTLNetMiner - Efficient search and prioritization of gene evidence networksQTLNetMiner - Efficient search and prioritization of gene evidence networks
QTLNetMiner - Efficient search and prioritization of gene evidence networksKeywan Hassani-Pak
 
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...FOODCROPS
 
Development and validation of V-chip, a DNA microarray for explorative analys...
Development and validation of V-chip, a DNA microarray for explorative analys...Development and validation of V-chip, a DNA microarray for explorative analys...
Development and validation of V-chip, a DNA microarray for explorative analys...Roxana Hickey
 
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...Behnam Taraghi
 
Human genome project [autosaved]
Human genome project [autosaved]Human genome project [autosaved]
Human genome project [autosaved]keerthi samuel
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches Sayali Magar
 
SpeciesDelimitation_Uebelmannia
SpeciesDelimitation_UebelmanniaSpeciesDelimitation_Uebelmannia
SpeciesDelimitation_UebelmanniaMilenaCardoso21
 
Lecaut et al 2012
Lecaut et al 2012Lecaut et al 2012
Lecaut et al 2012Fran Flores
 
EVE 161 Winter 2018 Class 16
EVE 161 Winter 2018 Class 16EVE 161 Winter 2018 Class 16
EVE 161 Winter 2018 Class 16Jonathan Eisen
 
Shotmap meta center_2014
Shotmap meta center_2014Shotmap meta center_2014
Shotmap meta center_2014tsharpton
 
the others our biased perspective
the others our biased perspectivethe others our biased perspective
the others our biased perspectiveJoão Soares
 
THE HUMEN GENOM PROJECT
THE HUMEN GENOM PROJECTTHE HUMEN GENOM PROJECT
THE HUMEN GENOM PROJECTbishal120
 
Development of a high-throughput high-density SNP genotyping array for bovine
Development of a high-throughput high-density SNP genotyping array for bovineDevelopment of a high-throughput high-density SNP genotyping array for bovine
Development of a high-throughput high-density SNP genotyping array for bovineAffymetrix
 
So close no matter how far: sympatric slow worm lizards look alike but share ...
So close no matter how far: sympatric slow worm lizards look alike but share ...So close no matter how far: sympatric slow worm lizards look alike but share ...
So close no matter how far: sympatric slow worm lizards look alike but share ...EvanthiaThanou1
 
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...Genetic variability and phylogenetic relationships studies of Aegilops L. usi...
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...Innspub Net
 
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...QIAGEN
 

What's hot (20)

QTLNetMiner - Efficient search and prioritization of gene evidence networks
QTLNetMiner - Efficient search and prioritization of gene evidence networksQTLNetMiner - Efficient search and prioritization of gene evidence networks
QTLNetMiner - Efficient search and prioritization of gene evidence networks
 
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...
2015. Bradley j. Till. Forward and reverse genetics for functional genomics a...
 
Development and validation of V-chip, a DNA microarray for explorative analys...
Development and validation of V-chip, a DNA microarray for explorative analys...Development and validation of V-chip, a DNA microarray for explorative analys...
Development and validation of V-chip, a DNA microarray for explorative analys...
 
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...
Survival of the Fittest – Utilization of Natural selection Mechanisms for Imp...
 
Human genome project [autosaved]
Human genome project [autosaved]Human genome project [autosaved]
Human genome project [autosaved]
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches
 
SpeciesDelimitation_Uebelmannia
SpeciesDelimitation_UebelmanniaSpeciesDelimitation_Uebelmannia
SpeciesDelimitation_Uebelmannia
 
Lecaut et al 2012
Lecaut et al 2012Lecaut et al 2012
Lecaut et al 2012
 
EVE 161 Winter 2018 Class 16
EVE 161 Winter 2018 Class 16EVE 161 Winter 2018 Class 16
EVE 161 Winter 2018 Class 16
 
Shotmap meta center_2014
Shotmap meta center_2014Shotmap meta center_2014
Shotmap meta center_2014
 
the others our biased perspective
the others our biased perspectivethe others our biased perspective
the others our biased perspective
 
PMED Opening Workshop - Individual Variability or Just Variability - Ruy Ribe...
PMED Opening Workshop - Individual Variability or Just Variability - Ruy Ribe...PMED Opening Workshop - Individual Variability or Just Variability - Ruy Ribe...
PMED Opening Workshop - Individual Variability or Just Variability - Ruy Ribe...
 
THE HUMEN GENOM PROJECT
THE HUMEN GENOM PROJECTTHE HUMEN GENOM PROJECT
THE HUMEN GENOM PROJECT
 
Development of a high-throughput high-density SNP genotyping array for bovine
Development of a high-throughput high-density SNP genotyping array for bovineDevelopment of a high-throughput high-density SNP genotyping array for bovine
Development of a high-throughput high-density SNP genotyping array for bovine
 
So close no matter how far: sympatric slow worm lizards look alike but share ...
So close no matter how far: sympatric slow worm lizards look alike but share ...So close no matter how far: sympatric slow worm lizards look alike but share ...
So close no matter how far: sympatric slow worm lizards look alike but share ...
 
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...Genetic variability and phylogenetic relationships studies of Aegilops L. usi...
Genetic variability and phylogenetic relationships studies of Aegilops L. usi...
 
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
Resume
ResumeResume
Resume
 
peerj-1949
peerj-1949peerj-1949
peerj-1949
 

Viewers also liked

Catching fire
Catching fireCatching fire
Catching fire17arturt
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to BiodiversityMark McGinley
 
Ecological indices report 2222222
Ecological indices report 2222222Ecological indices report 2222222
Ecological indices report 2222222Hotaru Imai
 
The measurement of biodiversity
 The measurement of biodiversity The measurement of biodiversity
The measurement of biodiversityMuhammed sadiq
 
Diversity of the microbial world 2008 2009
Diversity of the microbial world 2008 2009Diversity of the microbial world 2008 2009
Diversity of the microbial world 2008 2009aiiinura
 
Measuring Biodiversity
Measuring BiodiversityMeasuring Biodiversity
Measuring BiodiversityNigel Gardner
 
BiS2C: Lecture 9: Microbial Diversity
BiS2C: Lecture 9: Microbial DiversityBiS2C: Lecture 9: Microbial Diversity
BiS2C: Lecture 9: Microbial DiversityJonathan Eisen
 

Viewers also liked (8)

Catching fire
Catching fireCatching fire
Catching fire
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to Biodiversity
 
Ecological indices report 2222222
Ecological indices report 2222222Ecological indices report 2222222
Ecological indices report 2222222
 
The measurement of biodiversity
 The measurement of biodiversity The measurement of biodiversity
The measurement of biodiversity
 
Diversity of the microbial world 2008 2009
Diversity of the microbial world 2008 2009Diversity of the microbial world 2008 2009
Diversity of the microbial world 2008 2009
 
Measuring Biodiversity
Measuring BiodiversityMeasuring Biodiversity
Measuring Biodiversity
 
BiS2C: Lecture 9: Microbial Diversity
BiS2C: Lecture 9: Microbial DiversityBiS2C: Lecture 9: Microbial Diversity
BiS2C: Lecture 9: Microbial Diversity
 
Measuring Biodiversity
Measuring BiodiversityMeasuring Biodiversity
Measuring Biodiversity
 

Similar to Comprehensive Exam Slides 11/13/2013

Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityQingpeng "Q.P." Zhang
 
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...2015. SarahHearne. From genebank to field- leveraging genomics to identify an...
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...FOODCROPS
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
MAGIC population in Vegetables
MAGIC population in VegetablesMAGIC population in Vegetables
MAGIC population in VegetablesAnusha K R
 
Teresa Coque Hospital Universitario Ramón y Cajal.
Teresa Coque  Hospital Universitario Ramón y Cajal. Teresa Coque  Hospital Universitario Ramón y Cajal.
Teresa Coque Hospital Universitario Ramón y Cajal. Fundación Ramón Areces
 
Genomic aided selection for crop improvement
Genomic aided selection for crop improvementGenomic aided selection for crop improvement
Genomic aided selection for crop improvementtanvic2
 
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...GigaScience, BGI Hong Kong
 
presentation pop genetics 23-24.pptx
presentation pop genetics 23-24.pptxpresentation pop genetics 23-24.pptx
presentation pop genetics 23-24.pptxAlamgirmunj
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteQIAGEN
 
Experimental methods and the big data sets
Experimental methods and the big data sets Experimental methods and the big data sets
Experimental methods and the big data sets improvemed
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiomeMick Watson
 
population genomics.pdf
population genomics.pdfpopulation genomics.pdf
population genomics.pdfshinycthomas
 
DNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity CoservationDNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity Coservationvishnugm
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Metagenomics and it’s applications
Metagenomics and it’s applicationsMetagenomics and it’s applications
Metagenomics and it’s applicationsSham Sadiq
 
02 designing of experiments and analysis of data in plant genetic resource ma...
02 designing of experiments and analysis of data in plant genetic resource ma...02 designing of experiments and analysis of data in plant genetic resource ma...
02 designing of experiments and analysis of data in plant genetic resource ma...Indranil Bhattacharjee
 

Similar to Comprehensive Exam Slides 11/13/2013 (20)

Novel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial DiversityNovel Computational Approaches to Investigate Microbial Diversity
Novel Computational Approaches to Investigate Microbial Diversity
 
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...2015. SarahHearne. From genebank to field- leveraging genomics to identify an...
2015. SarahHearne. From genebank to field- leveraging genomics to identify an...
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
MAGIC population in Vegetables
MAGIC population in VegetablesMAGIC population in Vegetables
MAGIC population in Vegetables
 
Teresa Coque Hospital Universitario Ramón y Cajal.
Teresa Coque  Hospital Universitario Ramón y Cajal. Teresa Coque  Hospital Universitario Ramón y Cajal.
Teresa Coque Hospital Universitario Ramón y Cajal.
 
Genomic aided selection for crop improvement
Genomic aided selection for crop improvementGenomic aided selection for crop improvement
Genomic aided selection for crop improvement
 
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
 
presentation pop genetics 23-24.pptx
presentation pop genetics 23-24.pptxpresentation pop genetics 23-24.pptx
presentation pop genetics 23-24.pptx
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro Suite
 
Experimental methods and the big data sets
Experimental methods and the big data sets Experimental methods and the big data sets
Experimental methods and the big data sets
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiome
 
population genomics.pdf
population genomics.pdfpopulation genomics.pdf
population genomics.pdf
 
DNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity CoservationDNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity Coservation
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
Metagenomics and it’s applications
Metagenomics and it’s applicationsMetagenomics and it’s applications
Metagenomics and it’s applications
 
02 designing of experiments and analysis of data in plant genetic resource ma...
02 designing of experiments and analysis of data in plant genetic resource ma...02 designing of experiments and analysis of data in plant genetic resource ma...
02 designing of experiments and analysis of data in plant genetic resource ma...
 

More from Qingpeng "Q.P." Zhang (9)

VenmoPlus
VenmoPlusVenmoPlus
VenmoPlus
 
Qingpeng zhang 0713
Qingpeng zhang 0713Qingpeng zhang 0713
Qingpeng zhang 0713
 
Qingpeng zhang 0711
Qingpeng zhang 0711Qingpeng zhang 0711
Qingpeng zhang 0711
 
VenmoPlus0708
VenmoPlus0708VenmoPlus0708
VenmoPlus0708
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
Qingpeng zhang week5
Qingpeng zhang week5Qingpeng zhang week5
Qingpeng zhang week5
 
Introducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 versionIntroducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 version
 
committee_meeting_1031
committee_meeting_1031committee_meeting_1031
committee_meeting_1031
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Comprehensive Exam Slides 11/13/2013

  • 1. Investigate the diversity of extremely complex metagenomic samples Qingpeng Zhang Department of Computer Science and Engineering Michigan State University Supervisor: Dr. Titus Brown
  • 2. Outline ● Significance and background – Metagenomics – Microbial diversity measurement ● Preliminary results – A novel method to investigate microbial diversity based on an efficient k-mer counting approach ● Proposed research – Prove effectiveness using test data sets – Tackle extremely large metagenomic data sets generated from extremely complex microbial samples
  • 3. The Great Prairie Grand Challenge ● How many different species in a soil sample? What is their abundance distribution? How different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie? ● “Grand Challenge” - extremely large data sets from extremely complex microbial community – Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005) – In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013) – Over a tera bases of sequences from Iowa cultivated and uncultivated
  • 4. Metagenomics and Next Generation Sequencing
  • 5. species Individuals OTUs 16S rRNAs sequences Unique k-mers total k-mers in WGS data Diversity measurement based on different unit concepts Whole genome sequencing reads Nature Reviews Genetics 6, 805-814, ettc. 97% similarity of 16S sequences
  • 6. Statistics for Diversity Estimation ● rarefaction curve – Quite incapable of dealing with the scale of diversity of the microbial world ● extrapolation from curves ● parametric estimators(need relative species abundance) ● non-parametric estimators(Chao1,etc.) – Lower bound estimator – Sensitive to underlying distribution
  • 7. The Goal of this Project ● Using whole genome shotgun metagenomic data set rather than 16S rRNA – Measuring the microbial diversity of samples alpha-diversity – Comparing microbial samples beta-diversity ● A novel method that is: – Binning-free – Assembly-free – Annotation-free – Reference-free ● Efficient (Memory and Time) – extremely large shotgun metagenomic data sets (Terabytes, etc.) – extremely diverse microbial communities (Soil, etc.)
  • 8. species Individuals OTUs 16S rRNAs sequences Unique k-mers total k-mers in WGS data Diversity measurement based on different unit concepts Whole genome sequencing reads Nature Reviews Genetics 6, 805-814, ettc. 97% similarity of 16S sequences
  • 9. Preliminary Results ● A novel method to investigate microbial diversity based on an efficient k-mer counting approach – Diversity measurement of one sample – Comparison of multiple samples
  • 10. An Approach to Count k-mer Efficiently ● ● an approach to count k-mer efficiently – • Highly scalable: Constant memory consuming, independent of k and dataset size • Probabilistic properties well suited to next generation sequencing datasets • With certain counting false positive rate as tradeoff because of collision
  • 11. What is khmer 's advantage? ● Good performance in time/memory usage ● Online counting, updating and retrieving (important for this project!!) ● With Python API – flexible and expandable (Zhang, Pell, Canino-Koning, Howe, & Brown, 2013,submitted)
  • 12. median k-mer frequency to represent the sequencing coverage of the read Using median k-mer frequency rather than average k-mer frequency can decrease the influence of sequencing error
  • 13. Mapping and k-mer coverage measures correlate for simulated genome data and a real E. coli data set (5m reads). (Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)
  • 14. iGS It there are Y reads with a sequencing depth of X. In other word, for each of those Y reads, there are X-1 other reads that cover the same DNA segment in a genome that single read originates. So we can estimate that there are Y/X distinct DNA segments with reads coverage as X. We term these distinct DNA segments in species genome as IGS(informative genomic segment). IGS(informative genomic segment) can represent the novel information of a genome
  • 15. N =G/(L-k+1) 1000000/(80-22+1) Borrowing statistical methods from OTU based diversity analysis, (rarefaction curve, estimators, etc.)
  • 16. Compare the contents of multiple metagenomics samples ● How different are two samples? ● – If sequencing coverage of a read from sample A in sample B >0, the segment in sample A that read originates exists in sample B
  • 17. Synthetic datasetsA:(same abundance) – SampleA: 100 species with 80 common to B – SampleB: 100 species with 80 common to A – SampleC: 100 species with 20 common to A/B, and 60 common to D – SampleD: 100 species with 20 common to A/B, and 60 common to D ●
  • 18. Synthetic datasetB: – Sample1A: ● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2 – Sample1B: ● species IDs: 1,2,3,14,15,16,17,18,19,20 relative abundance: 20:18:16:4:3:2:2:2:2:2 – Sample1C: ● species IDs: 21,22,3,4,5,6,7,8,9,10 relative abundance: 2:2:2:2:2:3:4:16:18:20 – A and B high overlap on individual level, low overlap on species level A and C high overlap on species level, low overlap on individual level – B and C low overlap on species level and low overlap on individual level
  • 19. What's Next ● Refi ne the methods – Errors are still haunting. – More statistics of IGSs(informative genomic segment) ● Prove effectiveness using test data sets – Simulated data sets based on real microbial genomes – MetaHIT, 124 metagenomic samples from 99 healthy people, and 25 patients with inflammatory bowel disease (IBD) syndrome. Each sample has on average 65 ± 21 million reads. ● Integrate functions into khmer package
  • 20. The Great Prairie Grand Challenge ● How many different species in a soil sample? What is their abundance distribution? How different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa prairie? ● “Grand Challenge” - extremely large data sets from extremely complex microbial community – Over a tera bases of sequences from Iowa cultivated and uncultivated – Should be prepared to face technical challenge when dealing with such large-scale data sets (Storage, Computing, Resource, HPCC, etc.) – A preliminary result :The majority of the prairie reads (50%) are present in the corn with a coverage of > 0
  • 21. Acknowledgement ● Dr. Titus Brown ● Lab members of GED ● Dr. Jason Pell ● Dr. Adina Howe ● Eric McDonald ● Everybody in this room