Imputation for genotyping by sequencing
Emma Huang, Chitra Raghavan, Ramil Mauleon, Karl Broman, Hei Leung
CSIRO MATHEMATI...
CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP
Comparing Designs
FOAM 2014
Resolution/Diversity 
Allelefrequency/Power
BC
F2
RIL
MAGIC
Natural populations
Experimental...
MAGIC Wheat
Inbreeding
No mixing
2 generations
intercrossing
3 generations
intercrossing
Double haploids
FOAM 20144
MAGIC Arabidopsis
FOAM 2014
H I C D E J K LA B C D E F F G
X
Kover et al. PLoS Genet 2009
Arabidopsis MAGIC
• 19 founders, outcrossed for four generations
• Lines from 342 F4 families selfed for 6 generations
• F...
MAGIC Rice
FOAM 2014
Indica Japonica
X
Bandillo et al. Rice 2013 6:11
• ~2000 lines selfed for 6-8 generations
• Prelimina...
Organisms
• 125 Mb
• Diploid
• 17 Gb
• Hexaploid
• 430 Mb
• Diploid
FOAM 20148
Major differences in resources
www.wheatgenome.org
Arabidopsis: reference genomes, annotation, …
Rice: reference genome (j...
Genotypes
60x founders
.5x progeny
9K/90K SNP chipsLow-coverage GBS,
founders and progeny
FOAM 201410
• Stretches of missing values where reads don't align
• Arabidopsis: .5x coverage, 500K/3M SNPs  17% of total
• Rice: Fil...
• Missing data (random)
• Comparison across studies (systematic)
Genotype Imputation
1 0 - 1 - 1
1 0 - 1 - 1
0 0 - 0 - 1
1...
Typical approach
FOAM 2014
High-density
reference panel
• Phasing
Low-density targets
• HMM
• Pedigree
Probabilities
• Pha...
History
FOAM 2014
Software Release Date Author Institute
(fast)PHASE 2001/2006 Stephens Chicago
MACH 2007 Abecasis Michiga...
Top-down
FOAM 2014
Reference
Panel
Subj_ct 1 S_bj_ct 3
_ubj_c_ 2
FOAM 2014
Spacing
(/cM)
N %MISS %B %M %K
1 200 30 93.7 96.3 79.8
1 200 40 93.0 95.5 78.8
1 200 50 92.0 94.8 77.5
1 400 30 ...
• Higher coverage
• Different platform
• More replicates
• …
Simplest solution: get more data
FOAM 2014
FOAM 2014
Progeny
Fo_nder
A
F_und_r
B
Fo__der
C
_oun__r
D
18
Very simple approach
FOAM 2014
F o u n d e r
A 1 1 - 1 1 1 0
B 1 - 1 0 1 - 1
C 1 - - 0 1 - 0
D - 1 1 1 - 0 1
F o u n d e r...
Very simple approach
FOAM 2014
F o u n d e r
A 1 1 ? 1 1? 1 0
B 1 0 1 0 1? ? 1
C 1 0 ? 0 1? ? 0
D 0 1 1 1 0 0 1
F o u n d ...
More complicated version
FOAM 2014
• Missing data in progeny
• Recombination between markers
• Genotyping error in progeny
• MAGIC 8-parent populations
• Masked out founder values and progeny values
• Varying marker density, sample size, missing...
Simulations
FOAM 2014
Spacing
(/cM)
N %MISS %F0 %FC %FK
1 200 30 46.9 100 86.6
1 200 40 24.5 100 85.4
1 200 50 9.8 99.6 83...
178 F4 lines, 37240 markers after filtering
~21% missing parents; 38% missing progeny
Masked data on Chr 1 from 1130 marke...
• Wheat: requirement of map position
• Arabidopsis: resequenced founders; detection of other variants?
• Density of marker...
CCI
Emma Huang
t +61 7 3833 5542
e Emma.Huang@csiro.au
Thanks!
COMPUTATIONAL INFORMATICS AND FOOD FUTURES FLAGSHIP
xkcd.com
Upcoming SlideShare
Loading in …5
×

Imputation for genotyping by sequencing - Emma Huang

1,925 views

Published on

Genotyping-by-sequencing (GBS) technology has made dense genotyping cost-effective for many species. However, the high levels of missing data can result in a large loss of information. The popularity of GBS makes the development of efficient imputation approaches a priority. Here we consider imputation under the further difficulty caused by multi-parental experimental crosses. We present an approach to imputing founder genotypes which allows recovery of a large proportion of markers. Once these have been imputed, we compare three approaches to imputing progeny genotypes and apply our strategy to an eight-parent rice population to demonstrate the potential gain from imputation.

Published in: Science, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,925
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
60
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Imputation for genotyping by sequencing - Emma Huang

  1. 1. Imputation for genotyping by sequencing Emma Huang, Chitra Raghavan, Ramil Mauleon, Karl Broman, Hei Leung CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP
  2. 2. CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP
  3. 3. Comparing Designs FOAM 2014 Resolution/Diversity  Allelefrequency/Power BC F2 RIL MAGIC Natural populations Experimental Crosses Biparental Crosses NAM
  4. 4. MAGIC Wheat Inbreeding No mixing 2 generations intercrossing 3 generations intercrossing Double haploids FOAM 20144
  5. 5. MAGIC Arabidopsis FOAM 2014 H I C D E J K LA B C D E F F G X Kover et al. PLoS Genet 2009
  6. 6. Arabidopsis MAGIC • 19 founders, outcrossed for four generations • Lines from 342 F4 families selfed for 6 generations • Founder lines resequenced (60x coverage) ~3M SNPs • ~500 progeny sequenced (.5x coverage) ~500K SNPs FOAM 20146
  7. 7. MAGIC Rice FOAM 2014 Indica Japonica X Bandillo et al. Rice 2013 6:11 • ~2000 lines selfed for 6-8 generations • Preliminary genotyping/phenotyping of 200 lines at S4 • Further genotyping by sequencing (GBS) planned for S8 and founder lines
  8. 8. Organisms • 125 Mb • Diploid • 17 Gb • Hexaploid • 430 Mb • Diploid FOAM 20148
  9. 9. Major differences in resources www.wheatgenome.org Arabidopsis: reference genomes, annotation, … Rice: reference genome (japonica) Wheat: FOAM 20149
  10. 10. Genotypes 60x founders .5x progeny 9K/90K SNP chipsLow-coverage GBS, founders and progeny FOAM 201410
  11. 11. • Stretches of missing values where reads don't align • Arabidopsis: .5x coverage, 500K/3M SNPs  17% of total • Rice: Filtering process reduces 159,522 SNPs  12,767 (8%) How do we make use of the genome structure to fill in the gaps in our knowledge? Low-coverage GBS FOAM 2014
  12. 12. • Missing data (random) • Comparison across studies (systematic) Genotype Imputation 1 0 - 1 - 1 1 0 - 1 - 1 0 0 - 0 - 1 1 1 - 1 - 0 - - 1 - 0 - - - 1 - 1 - FOAM 201412
  13. 13. Typical approach FOAM 2014 High-density reference panel • Phasing Low-density targets • HMM • Pedigree Probabilities • Phases • Imputation
  14. 14. History FOAM 2014 Software Release Date Author Institute (fast)PHASE 2001/2006 Stephens Chicago MACH 2007 Abecasis Michigan BEAGLE 2007 Browning Washington AlphaImpute 2011 Hickey Roslin IMPUTE(2) 2009/2012 Marchini Oxford SHAPEIT(2) 2011/2013 Delaneau CNAM
  15. 15. Top-down FOAM 2014 Reference Panel Subj_ct 1 S_bj_ct 3 _ubj_c_ 2
  16. 16. FOAM 2014 Spacing (/cM) N %MISS %B %M %K 1 200 30 93.7 96.3 79.8 1 200 40 93.0 95.5 78.8 1 200 50 92.0 94.8 77.5 1 400 30 94.3 96.3 80.3 1 400 40 93.8 95.5 79.4 1 400 50 92.6 94.8 78.2 2 200 30 96.7 98.3 83.5 2 200 40 96.3 98.0 82.3 2 200 50 95.4 97.6 80.8 2 400 30 97.0 98.3 84.1 2 400 40 96.5 98.0 83.1 2 400 50 96.0 97.6 81.8 But what happens if our reference panel is incomplete? 16
  17. 17. • Higher coverage • Different platform • More replicates • … Simplest solution: get more data FOAM 2014
  18. 18. FOAM 2014 Progeny Fo_nder A F_und_r B Fo__der C _oun__r D 18
  19. 19. Very simple approach FOAM 2014 F o u n d e r A 1 1 - 1 1 1 0 B 1 - 1 0 1 - 1 C 1 - - 0 1 - 0 D - 1 1 1 - 0 1 F o u n d e r 0 27 48 26 36 43 43 51 1 73 52 74 64 57 57 49
  20. 20. Very simple approach FOAM 2014 F o u n d e r A 1 1 ? 1 1? 1 0 B 1 0 1 0 1? ? 1 C 1 0 ? 0 1? ? 0 D 0 1 1 1 0 0 1 F o u n d e r 0 27 48 26 36 43 43 51 1 73 52 74 64 57 57 49
  21. 21. More complicated version FOAM 2014 • Missing data in progeny • Recombination between markers • Genotyping error in progeny
  22. 22. • MAGIC 8-parent populations • Masked out founder values and progeny values • Varying marker density, sample size, missing % • Imputed founders and used those to impute all data Simulations FOAM 2014
  23. 23. Simulations FOAM 2014 Spacing (/cM) N %MISS %F0 %FC %FK 1 200 30 46.9 100 86.6 1 200 40 24.5 100 85.4 1 200 50 9.8 99.6 83.9 1 400 30 47.3 100 88.4 1 400 40 24.9 100 87.4 1 400 50 10.1 100 86.2 2 200 30 47.1 100 90.7 2 200 40 24.8 100 89.5 2 200 50 10.0 100 87.8 2 400 30 47.1 100 92.1 2 400 40 24.9 100 91.3 2 400 50 10.0 100 90.1
  24. 24. 178 F4 lines, 37240 markers after filtering ~21% missing parents; 38% missing progeny Masked data on Chr 1 from 1130 markers with full parent data Simulated 22% missingness 128 -> 1092 with 96% correctly imputed For all markers, 25.2% imputed up to 92.7% Rice data FOAM 2014
  25. 25. • Wheat: requirement of map position • Arabidopsis: resequenced founders; detection of other variants? • Density of markers • Level of missingness • Genotyping errors • Heterozygosity Relevance to other populations? FOAM 2014
  26. 26. CCI Emma Huang t +61 7 3833 5542 e Emma.Huang@csiro.au Thanks! COMPUTATIONAL INFORMATICS AND FOOD FUTURES FLAGSHIP xkcd.com

×