Integrative analysis in 1000 Genomes data                     Fuli Yu          BioInfoSummer, Adelaide Australia          ...
Outline•   Background overview of 1000G•   1000G Phase I results•   BCM NGS variation analysis software•   Further develop...
The history before the 1000 Genomes Projectwww.hapmap.org                               -Phase I and II: common SNPs in CE...
Disease mutations are likely rare and heterogeneous   McClellan J and King M-C, 2010                                    ‘C...
The quest for rare genetic variationHapMap1000G               Gibbs R 2005                                         5
Project goalwww.1000genomes.org“…sequence a large number of people, to provide acomprehensive resource on human genetic va...
1000 Genomes Project Design and                Progress• Pilot data collected in 2008; paper published  October 2010 in Na...
Nature, Oct 2010-179 WGS, 700 exon seq-15M new SNPs-CNV group-Exon group                                        8
1000 Genomes Project Design and                Progress• Pilot data collected in 2008; paper published  October 2010 in Na...
1000G Phase I populations                            10
Mark DePristo
An integrative map of 40 million variants      Low-pass       Low-pass                    Low-pass                        ...
1000 Genomes Project Design and                Progress• Pilot data collected in 2008; paper published  October 2010 in Na...
Discovery power• 1% SNPs   – 99.3% genome / 99.8% exome• 0.1% SNPs   – 70% genome / 90% exome- Exome high r2>0.9- with LD ...
Phase 1 variants are of high quality
Overall genotype accuracy at ~99%                               Hyun Min Kang
Sensitivity >96% in a given genome                              Hyun Min Kang
Rare variation is population specific   • 17% of low frequency (0.5-5%)     in a single ancestry group   • 53% of less tha...
Rare variants identify recent historical links           between populations                              ASW shows strong...
The proportion of rare variants by conservation                                      Tuuli Lappalainen
The proportion of rare variants by conservation                                      Tuuli Lappalainen
The proportion of rare variants by conservation                                      Tuuli Lappalainen
Implication for GWAS imputation                       Bryan Howie, Hyun Min Kang
BCM NGS PIPELINES: ATLAS2 &SNPTOOLS                              24
Overview of NGS variation analysis pipelinesSNPTools                               Atlas2                                 ...
Atlas uses logistic regression:                             systematic errors      Read      harboring      reference     ...
posterior Pr(SNP) using Bayesian           Read           harboring           reference           alleles                 ...
Exome data summary• 1128 (822 Illumina/306 SOLiD) samples in  20110521.alignment.index  – 822 Illumina BAMs     • MOSAIK  ...
Exome SNP calls on consensus target regions                                               Known Ti/Tv    Novel Ti/Tv      ...
SNPTools pipeline overview                              Raw Sequence Reads (FASTQ)                                 Short R...
EBD file format                  31
New algorithm for Genotype Likelihood• Challenges in Raw Genotype Likelihood  1.   Mapping/sequencing errors in site disco...
Rationale                                                                                small learning size              ...
BBMM overcomes platform heterogeneity                                    34
SOLiD GL: BBMM better than SamtoolsHM3OMNI                                                     35                         ...
Improvement of using BBMM GL also seen               in Beagle                             Hyun Min Kang Univ Mich        ...
SNPTools Imputation – ‘Constraint Li-Stephens’                                           37
Phase I Genotypes: Chr1, Chr20                                      (released 2011-05-08)                       OMNI      ...
Phasing accuracy evaluation                              39
Integrating known array genotypes                          sample                                       Direct re-weightin...
Integrating LowPass + ExomeOffTarget                                       41
Exome off-target reads are evenly distributed                                           42
Exome off-target reads improve sensitivity                                  •~5% improved                                 ...
1000G NEW DEVELOPMENT &TIMELINE TO COMPLETION                          44
1000 Genomes Project Design and                Progress• Pilot data collected in 2008; paper published  October 2010 in Na...
1000G Phase 2/3 populations CHD                                    CDX      ACB               PJL            GWD          ...
Overview of AFR Phase 2 Call Set Sizes                       (chr20)                          Alignment-based Call SetsAss...
A time-line• Data generation (incl, LC, exome, CG, SNP arrays) by end  March.• Final alignment index from DCC by start Jun...
AcknowledgementsBCM-HGSC• Yi Wang: SNPTOOLS         Boston College          Univ of Michigan• Jin Yu: Atlas-SNP          •...
Postdoc positions availableContactFuli Yufyu@bcm.edu                              50
Upcoming SlideShare
Loading in …5
×

Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

1,201 views

Published on

1000 Genomes - A deep catalog of Human Genetic Variation

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,201
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Integrative Analysis in 1000 Genomes - BioInfoSummer 2012 (Fuli Yu)

  1. 1. Integrative analysis in 1000 Genomes data Fuli Yu BioInfoSummer, Adelaide Australia 2012 1
  2. 2. Outline• Background overview of 1000G• 1000G Phase I results• BCM NGS variation analysis software• Further development and timeline 2
  3. 3. The history before the 1000 Genomes Projectwww.hapmap.org -Phase I and II: common SNPs in CEU, CHB, JPT, YRI -HapMap3: 11 populations -Patterns of linkage disequilibrium and haplotypes defined genome-wide Impacts • Complex diseases gene mapping – GWAS. • Characteristics of the human genome variants: allele frequency spectrum, LD patterns, recombination rate variation… • Population genetics: selection, migration, drift, admixture 1,449 published GWA at p≤5x10-8 for 237 traits 3
  4. 4. Disease mutations are likely rare and heterogeneous McClellan J and King M-C, 2010 ‘Clan Genomics’ Lupski JR et al. 2011 4
  5. 5. The quest for rare genetic variationHapMap1000G Gibbs R 2005 5
  6. 6. Project goalwww.1000genomes.org“…sequence a large number of people, to provide acomprehensive resource on human genetic variation…”“…find most genetic variants that have frequencies of atleast 1% in the populations studies…” 6
  7. 7. 1000 Genomes Project Design and Progress• Pilot data collected in 2008; paper published October 2010 in Nature – Companions in Science and Genome Research – Other companions later• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012 – Phase 2 / Phase 3 being completed• Sequencing completion - early 2013 – Analysis completion in 2013-2014
  8. 8. Nature, Oct 2010-179 WGS, 700 exon seq-15M new SNPs-CNV group-Exon group 8
  9. 9. 1000 Genomes Project Design and Progress• Pilot data collected in 2008; paper published October 2010 in Nature – Companions in Science and Genome Research – Other companions later• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012 – Phase 2 / Phase 3 being completed• Sequencing completion - early 2013 – Analysis completion in 2013-2014
  10. 10. 1000G Phase I populations 10
  11. 11. Mark DePristo
  12. 12. An integrative map of 40 million variants Low-pass Low-pass Low-pass Low-pass Low-pass Low-pass Genomes Low-pass Low-pass Genomes Deep Genomes Genomes Low-pass Genomes Genomes Exomes Genomes Genomes Genomes SNPs INDELs SVs 38M 1.4M 14k Integrated Genotypes ~40M Hyun Min Kang 12
  13. 13. 1000 Genomes Project Design and Progress• Pilot data collected in 2008; paper published October 2010 in Nature – Companions in Science and Genome Research – Other companions later• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012 – Phase 2 / Phase 3 being completed• Sequencing completion - early 2013 – Analysis completion in 2013-2014
  14. 14. Discovery power• 1% SNPs – 99.3% genome / 99.8% exome• 0.1% SNPs – 70% genome / 90% exome- Exome high r2>0.9- with LD information, WGS genotype - improves MAF>=1% by 30-40% - unchanges MAF<0.1%
  15. 15. Phase 1 variants are of high quality
  16. 16. Overall genotype accuracy at ~99% Hyun Min Kang
  17. 17. Sensitivity >96% in a given genome Hyun Min Kang
  18. 18. Rare variation is population specific • 17% of low frequency (0.5-5%) in a single ancestry group • 53% of less than 0.5% in a single population • African populations have many more low frequency variants due to bottleneck on other lineages • All populations are enriched in rare variants – Explosive recent population growthSlide Courtesy of Paul Flicek Adam Auton, Gil McVean
  19. 19. Rare variants identify recent historical links between populations ASW shows stronger sharing with YRI than LWK 48% of IBS variants shared with AmericanAdam Auton, Gil McVean populations
  20. 20. The proportion of rare variants by conservation Tuuli Lappalainen
  21. 21. The proportion of rare variants by conservation Tuuli Lappalainen
  22. 22. The proportion of rare variants by conservation Tuuli Lappalainen
  23. 23. Implication for GWAS imputation Bryan Howie, Hyun Min Kang
  24. 24. BCM NGS PIPELINES: ATLAS2 &SNPTOOLS 24
  25. 25. Overview of NGS variation analysis pipelinesSNPTools Atlas2 25 Nielsen R 2011
  26. 26. Atlas uses logistic regression: systematic errors Read harboring reference alleles j=1 (0/1) 2 (0/0) . . . . m (0/0) Reference i=1, sequence Read 2, harboring . substitutions . ., n Pr(SNP)ilog b1 RawQuality b2 Swap b3 NQS b4 Dist 1 Pr(SNP)i Items Values derived from our Z Significance training experiment score (p-value) Intercept α -3.3 -39 <2e-16 Coefficient b1 for raw quality score 0.11 19 <2e-16 Coefficient b2 for swap -3.5 28 <2e-16 Coefficient b3 for NQS 0.26 3 0.001 Coefficient b4 for relative position -0.37 -4 0.0005 26 Shen et al. 2010 Genome Research
  27. 27. posterior Pr(SNP) using Bayesian Read harboring reference alleles j=1 (0/1) 2 (0/0) . . . . m (0/0) Reference i=1, sequence Read 2, harboring . substitutions . ., n Pr(error)i = 1 – Pr(SNP)i Pr(error)j = ∏ Pr(error)i Pr(SNP)j = 1- Pr(error)j = Sj Pr( S j | SNP , c) prior ( SNP | c)Pr( SNP | S j , c) j Pr( S j | SNP , c) prior ( SNP | c) Pr( S j | error , c) prior (error | c) 27 Shen et al. 2010 Genome Research
  28. 28. Exome data summary• 1128 (822 Illumina/306 SOLiD) samples in 20110521.alignment.index – 822 Illumina BAMs • MOSAIK – 306 SOLiD BAMs • BFAST• SNPs are called using Atlas-SNP2 at BCM 28
  29. 29. Exome SNP calls on consensus target regions Known Ti/Tv Novel Ti/Tv %dbSNP merged / per- merged / per- Platform #Sample # SNP b132 sample sample Illumina+ SOLiD 1128 457,095 29.23% 3.47/3.41 3.05/2.97 SOLiD 306 244,736 42.05% 3.54 / 3.51 3.19/ 3.03 Illumina 822 348,599 35.94% 3.46/3.37 2.99/2.95 VQSR v2b Unique Intersection Baylor Exome Unique #SNP: 23,096 #SNP:238,356 #SNP: 218,739 dbSNP: 15.3% dbSNP: 48.5% dbSNP: 8.2% Ti/Tv: 2.67 Ti/Tv : 3.35 Ti/Tv: 2.97 29
  30. 30. SNPTools pipeline overview Raw Sequence Reads (FASTQ) Short Reads Mapping Base Quality Recalibration Binary sequence Alignment/Map Files (BAM) •Novel Effective Base Depth (EBD) summarization for each BAMEffective Base •High performance IO, small disk foot print (1~2GB per BAM) Depth •Novel variance ratio based site discovery statisticsSNP Site •High sensitivity and specificityDiscovery •Novel BAM-specific binomial mixture modeling (BBMM)SequenceGenotype •Capture BAM heterogeneityLikelihood Exist •‘Dynamic linking’ of multiple exist genotype datasets with Bayesian styleGenotype •Improve both exist genotypes and sequence calls significantlyIntegratio n •Novel imputation engineGenotypeImputatio •High genotyping and phasing accuracy n Haplotype with Confidence Score (VCF) 30 Downstream Analysis
  31. 31. EBD file format 31
  32. 32. New algorithm for Genotype Likelihood• Challenges in Raw Genotype Likelihood 1. Mapping/sequencing errors in site discovery 2. BAM heterogeneity, potential contamination• Solutions 1. Novel concept of Effective Base Depth (EBD) to summarize sequence details 2. BAM-specific binomial mixture model handles BAM heterogeneity 32
  33. 33. Rationale small learning size BAM heterogeneity 1094 low accuracy for alt/alt• BAM-specific modeling BAMs – Using whole-genome VQSR site specific modeling sites – Perform 3-component BBMM on each BAM using Phase I 39M VQSR SNPs VQSR (38M) SNPs sites – High precision modeling with P(ri ) = wg B(ri + ai ,eg ) g=rr,ra,aa 38M data points! – Make SNP array free QC on individual BAMs huge learning size high accuracy for alt/alt as one QC metric BAM specific modeling 33
  34. 34. BBMM overcomes platform heterogeneity 34
  35. 35. SOLiD GL: BBMM better than SamtoolsHM3OMNI 35 Hyun Min Kang Univ Mich
  36. 36. Improvement of using BBMM GL also seen in Beagle Hyun Min Kang Univ Mich 36
  37. 37. SNPTools Imputation – ‘Constraint Li-Stephens’ 37
  38. 38. Phase I Genotypes: Chr1, Chr20 (released 2011-05-08) OMNI HM3 Axiom call set AA RA RR non-ref AA RA RR non-ref AA RA RR non-ref chr1 1.03 1.02 0.19 1.43 1.64 0.86 0.21 1.43 0.85 1.38 0.19 1.51 chr20 1.02 1.18 0.23 1.60 1.22 0.88 0.25 1.30 1.33 1.48 0.22 1.85chr20 V4 1.33 1.21 0.37 2.02 1.20 0.83 0.25 1.26 1.36 1.45 0.21 1.83 chr20* 0.99 1.17 0.22 1.57 1.18 0.88 0.25 1.28 1.23 1.47 0.21 1.79chr20 V4* 1.01 1.11 0.22 1.52 1.18 0.83 0.24 1.25 1.24 1.44 0.21 1.77 •chr1 and chr20 are based on new VQSR sites •chr20 V4 is based on old VQSR sites •chr20* and chr20 V4* are the overlapped sites between new VQSR and old VQSR Chr20 genotype call set Better OMNI concordance than V4 due to site/allele selection improvement Similar accuracy on overlapped sites Chr1 genotype call set Slightly better than chr20 call set 38
  39. 39. Phasing accuracy evaluation 39
  40. 40. Integrating known array genotypes sample Direct re-weighting of overall accuracy. Improvement is in proportion to the number known genotype integrated.known genotypes s Imputation improvement of on- i array accuracy. Known t genotypes are treated as e s 99.98% confidence priors which raw genotype is still improvable. probabilities Imputation improvement of off- array accuracy. Make full use of the LD between on and off array sites. 40
  41. 41. Integrating LowPass + ExomeOffTarget 41
  42. 42. Exome off-target reads are evenly distributed 42
  43. 43. Exome off-target reads improve sensitivity •~5% improved sensitivity in off targets
  44. 44. 1000G NEW DEVELOPMENT &TIMELINE TO COMPLETION 44
  45. 45. 1000 Genomes Project Design and Progress• Pilot data collected in 2008; paper published October 2010 in Nature – Companions in Science and Genome Research – Other companions later• Full project data collection and analysis underway – Phase 1 results published Nov 1st 2012 – Phase 2 / Phase 3 being completed• Sequencing completion - early 2013 – Analysis completion in 2013-2014
  46. 46. 1000G Phase 2/3 populations CHD CDX ACB PJL GWD GHI MSL BEB KHV ESN ITUPEL STU 46
  47. 47. Overview of AFR Phase 2 Call Set Sizes (chr20) Alignment-based Call SetsAssembly-based Call Sets 511K 500000 481K 480K 491K 460K 452K SNPs 362K 252K 200000 195K 0 0 BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2 48K 49K 46K 42K 44K 42K 20000 40000 Indels/ 28KCplxsubs 17K 0 0 0 BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2 19K MNPs 15000 8K 5000 4K 3K 0 0 206 0 0 0 0 BCM BI1 LU SI1 UM BC BI2 OX1 OX2 SI2 Adrian Tan, Hyun Min Kang 47
  48. 48. A time-line• Data generation (incl, LC, exome, CG, SNP arrays) by end March.• Final alignment index from DCC by start June.• Contributing call sets (SNP, indel, MNP, complex, SV) by end July• Consensus and resolved site list with GLs by end August• Integrated haplotypes by ASHG 2013 Gil McVean
  49. 49. AcknowledgementsBCM-HGSC• Yi Wang: SNPTOOLS Boston College Univ of Michigan• Jin Yu: Atlas-SNP • Gabor Marth• Danny Challis: Atlas-INDEL • Amit Indap • Goncalo Abecasis• Uday Evani: VCFPRINTER • Wen Fung Leong • Hyun Min Kang• Matthew Bainbridge • Alistair Ward• Donna Muzny• Jeffrey Reid Broad Institute BlueBioU@Rice University• Richard Gibbs • Mark DePristo • Ryan Poplin • Kim Andrews BCM-BRL • Eric Banks • Roger Moye • Chandler Wilkerson• Andrew R. Jackson• Sameer Paithankar Stanford University• Cristian Coarfa • Simon Gravel• Aleksandar Milosavljevic • Carlos Bustamante 49
  50. 50. Postdoc positions availableContactFuli Yufyu@bcm.edu 50

×