SlideShare a Scribd company logo
Introduction Methods Results Conclusions
Sequence Based Identity by Descent Detection
Joint work with Jasmine Nirody & Yun S. Song
@ University of California, Berkeley
Paula Tataru
Mols Meeting
August 15, 2013
Sequence Based IBD detection 1
Introduction Methods Results Conclusions
Sequence Based IBD detection 2
Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Identity By Descent (IBD) tracts
DNA segments that are inherited from a common ancestor
recombination disrupts them
expected length depends on the TMRCA
Sequence Based IBD detection 3
Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
IBD is fundamental in genetics
selection
phasing
imputation
association studies
Sequence Based IBD detection 4
Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Current methods use population-wide SNP genotype data
work best for recent IBD (longer than 1cM)
different IBD definitions
pairwise SNPs disrupt predicted IBD tracts
probabilistic, deterministic
Sequence Based IBD detection 5
Introduction Methods Results Conclusions
GERMLINE
Gusev et al., 2009
Identical By State (IBS)
Deterministic
Linear in number of samples
Phased SNP data
Sliding window to find IBS
Allows for genotyping error
Sequence Based IBD detection 6
Introduction Methods Results Conclusions
FastIBD
Browning & Browning, 2011
IBD inside IBS
Deterministic
Quadratic in number of samples
Unphased SNP data; phasing done with Beagle
Accounts for phase uncertainty and background levels of LD
Models shared haplotype frequencies
Sequence Based IBD detection 7
Introduction Methods Results Conclusions
RefinedIBD
Browning & Browning, 2013
IBD inside IBS
Probabilistic
Quadratic in number of samples
Very similar to FastIBD
Identifies candidate IBD segments using GERMLINE
Filter candidates based on a probabilistic model
Sequence Based IBD detection 8
Introduction Methods Results Conclusions
SMCSD
Paul et al., 2011, Sheehan et al., 2013
same TMRCA
Probabilistic: HMM
Quadratic in number of samples
Phased sequence data
Based on coalescence theory
Predicts recombination breakpoints that change TMRCA
Sequence Based IBD detection 9
Introduction Methods Results Conclusions
SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model
Sequence Based IBD detection 10
Introduction Methods Results Conclusions
SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model
Use decoding to infer IBD
assume demography given
run posterior decoding
changes of TMRCA reveal recombination breakpoints
use posterior probabilities to trim tracts’ endpoints
Sequence Based IBD detection 10
Introduction Methods Results Conclusions
Data simulation
Simulate trees in ms
µ = 1.25 × 10−8
r = 10−8
sequences of length 10MB
10 sequences (45 pairs)
10 replicates
Collect recombination breakpoints from ms output
Reconstruct pairwise IBD tracts
Sequence Based IBD detection 11
Introduction Methods Results Conclusions
Human Population
Tenessen et al., 2012, Simons et al., 2013
Sequence Based IBD detection 12
Introduction Methods Results Conclusions
Human Population
0.
0.0
0.5
1.0
CumProb
0 1000 2000 3000 4000 5000 6000
Generations back in time
103
104
105
106
PopSize
EA EA Watt A
Sequence Based IBD detection 13
Introduction Methods Results Conclusions
European Population
Recall Precision F-score0.0
0.5
1.0
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
TruePositive
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
FalseNegative
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
FalsePositive
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Power
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Under-prediction
0.1 0.55 1.0
Tract length (cM)
0
0.5
1.0
Over-prediction
0 1000 2000 3000 4000 5000 6000
Generations back in time
103
104
105
106
PopSize
GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T
Sequence Based IBD detection 14
Introduction Methods Results Conclusions
African Population
Recall Precision F-score0.0
0.5
1.0
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
TruePositive
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
FalseNegative
0.1 0.65 1.2
Tract length (cM)
0
0.5
1.0
FalsePositive
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Power
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Under-prediction
0.1 0.6 1.1
Tract length (cM)
0
0.5
1.0
Over-prediction
0 1000 2000 3000 4000 5000 6000
Generations back in time
103
104
105
106
PopSize
GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T
Sequence Based IBD detection 15
Introduction Methods Results Conclusions
Conclusion
Simulated data from outbred populations
Existing programs are strong performers for long tracts
SMCSD performs better on shorter tracts
SMCSD uses a more robust IBD definition
Sequence Based IBD detection 16
Introduction Methods Results Conclusions
Thank you!
Sequence Based IBD detection 17

More Related Content

Similar to Mols_August2013

CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANACELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
Roberto Scarafia
 
Pooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorPooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorDevin Petersohn
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
Kishor Tappita
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
Koppolu Ravi
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Jonathan Blakes
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
TELKOMNIKA JOURNAL
 
Learning classifiers from discretized expression quantitative trait loci
Learning classifiers from discretized expression quantitative trait lociLearning classifiers from discretized expression quantitative trait loci
Learning classifiers from discretized expression quantitative trait loci
NTNU
 
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptxGENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
PABOLU TEJASREE
 
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPINGFORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
iQHub
 
Cnv and a analysis strategies
Cnv and a analysis strategiesCnv and a analysis strategies
Cnv and a analysis strategiesElsa von Licy
 
PCR Types
PCR TypesPCR Types
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
A quantitative view on mRNA translation: the relative role of initiation and ...
A quantitative view on mRNA translation: the relative role of initiation and ...A quantitative view on mRNA translation: the relative role of initiation and ...
A quantitative view on mRNA translation: the relative role of initiation and ...
Lake Como School of Advanced Studies
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
Efi Athieniti
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Genomika Diagnósticos
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNAUlises Urzua
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 

Similar to Mols_August2013 (20)

CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANACELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANA
 
Pooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorPooled Sequence Haplotype Estimator
Pooled Sequence Haplotype Estimator
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
 
Learning classifiers from discretized expression quantitative trait loci
Learning classifiers from discretized expression quantitative trait lociLearning classifiers from discretized expression quantitative trait loci
Learning classifiers from discretized expression quantitative trait loci
 
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptxGENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
 
presentation
presentationpresentation
presentation
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPINGFORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
FORENSIC EPIGENETICS FOR BODILY FLUID TYPING, SUSPECT AGE, AND PHENOTYPING
 
Cnv and a analysis strategies
Cnv and a analysis strategiesCnv and a analysis strategies
Cnv and a analysis strategies
 
PCR Types
PCR TypesPCR Types
PCR Types
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
A quantitative view on mRNA translation: the relative role of initiation and ...
A quantitative view on mRNA translation: the relative role of initiation and ...A quantitative view on mRNA translation: the relative role of initiation and ...
A quantitative view on mRNA translation: the relative role of initiation and ...
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
 
Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNA
 
Bio conference live 2013
Bio conference live 2013Bio conference live 2013
Bio conference live 2013
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 

More from Paula Tataru

PaulaTataru_PhD_defense
PaulaTataru_PhD_defensePaulaTataru_PhD_defense
PaulaTataru_PhD_defensePaula Tataru
 
AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011Paula Tataru
 
AB-RNA-comparison-2011
AB-RNA-comparison-2011AB-RNA-comparison-2011
AB-RNA-comparison-2011Paula Tataru
 
AB-RNA-alignments-2011
AB-RNA-alignments-2011AB-RNA-alignments-2011
AB-RNA-alignments-2011Paula Tataru
 
AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011Paula Tataru
 
AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010Paula Tataru
 

More from Paula Tataru (14)

PhDretreat2014
PhDretreat2014PhDretreat2014
PhDretreat2014
 
PhDretreat2011
PhDretreat2011PhDretreat2011
PhDretreat2011
 
PaulaTataru_PhD_defense
PaulaTataru_PhD_defensePaulaTataru_PhD_defense
PaulaTataru_PhD_defense
 
birc-csd2012
birc-csd2012birc-csd2012
birc-csd2012
 
AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011AB-RNA-Mfold&SCFGs-2011
AB-RNA-Mfold&SCFGs-2011
 
AB-RNA-comparison-2011
AB-RNA-comparison-2011AB-RNA-comparison-2011
AB-RNA-comparison-2011
 
AB-RNA-alignments-2011
AB-RNA-alignments-2011AB-RNA-alignments-2011
AB-RNA-alignments-2011
 
AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011
 
AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010AB-RNA-SCFGdesign=2010
AB-RNA-SCFGdesign=2010
 
AB-RNA-SCFG-2010
AB-RNA-SCFG-2010AB-RNA-SCFG-2010
AB-RNA-SCFG-2010
 
AB-RNA-Nus-2010
AB-RNA-Nus-2010AB-RNA-Nus-2010
AB-RNA-Nus-2010
 
PaulaTataruAarhus
PaulaTataruAarhusPaulaTataruAarhus
PaulaTataruAarhus
 
mgsa_poster
mgsa_postermgsa_poster
mgsa_poster
 
PaulaTataruOxford
PaulaTataruOxfordPaulaTataruOxford
PaulaTataruOxford
 

Mols_August2013

  • 1. Introduction Methods Results Conclusions Sequence Based Identity by Descent Detection Joint work with Jasmine Nirody & Yun S. Song @ University of California, Berkeley Paula Tataru Mols Meeting August 15, 2013 Sequence Based IBD detection 1
  • 2. Introduction Methods Results Conclusions Sequence Based IBD detection 2
  • 3. Introduction Methods Results Conclusions G G A A C C T T G G A A G A C C Identity By Descent (IBD) tracts DNA segments that are inherited from a common ancestor recombination disrupts them expected length depends on the TMRCA Sequence Based IBD detection 3
  • 4. Introduction Methods Results Conclusions G G A A C C T T G G A A G A C C IBD is fundamental in genetics selection phasing imputation association studies Sequence Based IBD detection 4
  • 5. Introduction Methods Results Conclusions G G A A C C T T G G A A G A C C Current methods use population-wide SNP genotype data work best for recent IBD (longer than 1cM) different IBD definitions pairwise SNPs disrupt predicted IBD tracts probabilistic, deterministic Sequence Based IBD detection 5
  • 6. Introduction Methods Results Conclusions GERMLINE Gusev et al., 2009 Identical By State (IBS) Deterministic Linear in number of samples Phased SNP data Sliding window to find IBS Allows for genotyping error Sequence Based IBD detection 6
  • 7. Introduction Methods Results Conclusions FastIBD Browning & Browning, 2011 IBD inside IBS Deterministic Quadratic in number of samples Unphased SNP data; phasing done with Beagle Accounts for phase uncertainty and background levels of LD Models shared haplotype frequencies Sequence Based IBD detection 7
  • 8. Introduction Methods Results Conclusions RefinedIBD Browning & Browning, 2013 IBD inside IBS Probabilistic Quadratic in number of samples Very similar to FastIBD Identifies candidate IBD segments using GERMLINE Filter candidates based on a probabilistic model Sequence Based IBD detection 8
  • 9. Introduction Methods Results Conclusions SMCSD Paul et al., 2011, Sheehan et al., 2013 same TMRCA Probabilistic: HMM Quadratic in number of samples Phased sequence data Based on coalescence theory Predicts recombination breakpoints that change TMRCA Sequence Based IBD detection 9
  • 10. Introduction Methods Results Conclusions SMCSD in a nutshell Designed to estimate demographic history partition time in discrete intervals assume constant population size per time interval use EM to train model Sequence Based IBD detection 10
  • 11. Introduction Methods Results Conclusions SMCSD in a nutshell Designed to estimate demographic history partition time in discrete intervals assume constant population size per time interval use EM to train model Use decoding to infer IBD assume demography given run posterior decoding changes of TMRCA reveal recombination breakpoints use posterior probabilities to trim tracts’ endpoints Sequence Based IBD detection 10
  • 12. Introduction Methods Results Conclusions Data simulation Simulate trees in ms µ = 1.25 × 10−8 r = 10−8 sequences of length 10MB 10 sequences (45 pairs) 10 replicates Collect recombination breakpoints from ms output Reconstruct pairwise IBD tracts Sequence Based IBD detection 11
  • 13. Introduction Methods Results Conclusions Human Population Tenessen et al., 2012, Simons et al., 2013 Sequence Based IBD detection 12
  • 14. Introduction Methods Results Conclusions Human Population 0. 0.0 0.5 1.0 CumProb 0 1000 2000 3000 4000 5000 6000 Generations back in time 103 104 105 106 PopSize EA EA Watt A Sequence Based IBD detection 13
  • 15. Introduction Methods Results Conclusions European Population Recall Precision F-score0.0 0.5 1.0 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 TruePositive 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 FalseNegative 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 FalsePositive 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 Power 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 Under-prediction 0.1 0.55 1.0 Tract length (cM) 0 0.5 1.0 Over-prediction 0 1000 2000 3000 4000 5000 6000 Generations back in time 103 104 105 106 PopSize GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T Sequence Based IBD detection 14
  • 16. Introduction Methods Results Conclusions African Population Recall Precision F-score0.0 0.5 1.0 0.1 0.6 1.1 Tract length (cM) 0 0.5 1.0 TruePositive 0.1 0.6 1.1 Tract length (cM) 0 0.5 1.0 FalseNegative 0.1 0.65 1.2 Tract length (cM) 0 0.5 1.0 FalsePositive 0.1 0.6 1.1 Tract length (cM) 0 0.5 1.0 Power 0.1 0.6 1.1 Tract length (cM) 0 0.5 1.0 Under-prediction 0.1 0.6 1.1 Tract length (cM) 0 0.5 1.0 Over-prediction 0 1000 2000 3000 4000 5000 6000 Generations back in time 103 104 105 106 PopSize GERMLINE FastIBD RefinedIBD SMCSD-W SMCSD-T Sequence Based IBD detection 15
  • 17. Introduction Methods Results Conclusions Conclusion Simulated data from outbred populations Existing programs are strong performers for long tracts SMCSD performs better on shorter tracts SMCSD uses a more robust IBD definition Sequence Based IBD detection 16
  • 18. Introduction Methods Results Conclusions Thank you! Sequence Based IBD detection 17