SlideShare a Scribd company logo
1 of 16
Download to read offline
Phase2	
  SNP	
  calling:	
  	
  
Low-­‐coverage	
  and	
  exome	
  
Jin	
  Yu	
  
2012/05/07	
  
Overview	
  
•  Two	
  ideas	
  in	
  Phase2	
  SNP	
  calling	
  
– Use	
  exome	
  off-­‐target	
  reads	
  for	
  whole	
  genome	
  
SNP	
  site	
  discovery	
  
– Use	
  exome	
  genotype	
  calls	
  to	
  improve	
  overall	
  
genotype	
  accuracy	
  
•  Preliminary	
  results	
  and	
  plan	
  
Review	
  of	
  phase1	
  SNP	
  pipeline	
  
Low	
  coverage	
  (~4X)	
  WGS	
  
BAMs	
  
High	
  coverage	
  (~50X)	
  WES	
  
BAMs	
  
MulV-­‐sample	
  calling	
   Single-­‐sample	
  calling	
  
PopulaVon	
  SNP	
  sites	
  
Individual	
  SNP	
  sites	
  and	
  
genotypes	
  
	
  Apply	
  mulV-­‐center	
  consensus	
  strategy	
  and	
  merge	
  SNP	
  sites	
  
Impute	
  genotype/haplotype	
  
Calculate	
  genotype	
  likelihood	
  on	
  all	
  candidate	
  sites	
  
Two	
  ideas	
  in	
  Phase2	
  
Low	
  coverage	
  (~4X)	
  WGS	
  
BAMs	
  
High	
  coverage	
  (~50X)	
  WES	
  
BAMs	
  
MulV-­‐sample	
  calling	
   Single-­‐sample	
  calling	
  
PopulaVon	
  SNP	
  sites	
  
Individual	
  SNP	
  sites	
  and	
  
genotypes	
  
	
  Apply	
  mulV-­‐center	
  consensus	
  strategy	
  and	
  merge	
  SNP	
  sites	
  
Impute	
  genotype/haplotype	
  
Calculate	
  genotype	
  likelihood	
  on	
  all	
  candidate	
  sites	
  
The	
  first	
  idea	
  
•  Use	
  exome	
  off-­‐target	
  reads	
  in	
  whole	
  genome	
  
SNP	
  calling	
  
– Exome	
  off-­‐target	
  reads	
  have	
  significant	
  coverage	
  
on	
  whole	
  genome	
  coverage	
  
– Preliminary	
  results	
  showed	
  higher	
  SNP	
  sensiVvity	
  
and	
  reasonable	
  quality	
  
Exome	
  off-­‐target	
  reads	
  are	
  significant	
  
0	
  
5	
  
10	
  
15	
  
20	
  
NA11832	
  
NA12144	
  
NA12043	
  
NA12489	
  
NA12828	
  
NA12749	
  
NA12004	
  
NA11993	
  
NA11995	
  
NA12273	
  
NA06985	
  
NA12272	
  
NA12843	
  
NA12890	
  
NA12046	
  
NA12341	
  
NA12286	
  
NA11992	
  
NA12348	
  
NA11918	
  
NA12275	
  
NA11893	
  
NA12827	
  
NA12718	
  
NA11919	
  
NA07048	
  
NA12778	
  
NA12546	
  
NA12777	
  
NA06989	
  
NA12829	
  
NA11881	
  
NA12751	
  
NA12717	
  
NA12400	
  
NA12340	
  
NA11843	
  
NA12154	
  
NA12716	
  
NA12155	
  
NA12750	
  
NA07037	
  
NA12249	
  
NA10847	
  
NA11994	
  
Average	
  coverage	
  on	
  off-­‐target	
  regions	
  
exome	
  
lowpass	
  
0%	
  
20%	
  
40%	
  
60%	
  
80%	
  
100%	
  
NA11832	
  
NA12144	
  
NA12043	
  
NA12489	
  
NA12828	
  
NA12749	
  
NA12004	
  
NA11993	
  
NA11995	
  
NA12273	
  
NA06985	
  
NA12272	
  
NA12843	
  
NA12890	
  
NA12046	
  
NA12341	
  
NA12286	
  
NA11992	
  
NA12348	
  
NA11918	
  
NA12275	
  
NA11893	
  
NA12827	
  
NA12718	
  
NA11919	
  
NA07048	
  
NA12778	
  
NA12546	
  
NA12777	
  
NA06989	
  
NA12829	
  
NA11881	
  
NA12751	
  
NA12717	
  
NA12400	
  
NA12340	
  
NA11843	
  
NA12154	
  
NA12716	
  
NA12155	
  
NA12750	
  
NA07037	
  
NA12249	
  
NA10847	
  
NA11994	
  
%	
  off-­‐target	
  reads	
  in	
  exome	
  capture	
  sequencing	
  
• 	
  ~50%	
  of	
  exome	
  capture	
  sequencing	
  reads	
  are	
  off-­‐target	
  
• 	
  off-­‐target	
  reads	
  add	
  ~1X	
  average	
  coverage	
  across	
  the	
  whole	
  genome	
  
Exome	
  off-­‐target	
  reads	
  are	
  evenly	
  distributed	
  
• Weighted	
  read	
  depths	
  calculated	
  using	
  EBD	
  in	
  5kb	
  sliding	
  windows	
  across	
  chr20	
  
	
  
SNP	
  calling	
  experiment	
  
•  Using	
  all	
  1449	
  phase2	
  lowpass	
  BAMs	
  and	
  1182	
  
exome	
  Illumina	
  BAMs	
  
•  Calling	
  model	
  modified	
  from	
  SNPtools	
  
– Combining	
  reads	
  of	
  the	
  same	
  sample	
  to	
  esVmate	
  
the	
  variance	
  of	
  true	
  variant	
  reads	
  
– Grouping	
  reads	
  of	
  the	
  same	
  sequencing	
  plaborm	
  
to	
  esVmate	
  the	
  variance	
  of	
  plaborm	
  specific	
  bias	
  
SNP	
  calls	
  comparison	
  	
  
(chr20	
  off-­‐target	
  regions)	
  
#SNP	
   Ti/Tv	
  
#	
  in	
  
Phase1	
  
Known	
  
Ti/Tv	
  
%	
  Rare	
  
	
  (MAF<	
  
1%)	
  
%	
  Novel	
  
to	
  
Phase1	
  
Novel	
  
Ti/Tv	
  
OMNI	
  poly	
  
sensiWvity	
  
OMNI	
  mono	
  
False	
  
discovery	
  
BI	
  phase2	
  
baseline	
  
821,141	
   2.31	
   514,021	
  	
   2.34	
  	
   72.5%	
   37.4%	
   2.24	
  
98.2%	
  
(50,195/51,126)	
  
0.9%	
  
(12/1265)	
  
BCM	
  phase2	
  
baseline	
  
847,274	
  	
  2.33	
   502,517	
  	
   2.42	
  	
   68.6%	
   40.7%	
   2.20	
  
98.6%	
  
(50,406/51,126)	
  
1.9%	
  
(24/1265)	
  
BCM	
  Phase2	
  
experimental	
  	
  
911,602	
  	
  2.32	
   521,189	
  	
   2.42	
  	
   69.7%	
   42.8%	
   2.19	
  
98.8%	
  
(50,494/51,126)	
  
2.1%	
  
(27/1265)	
  
AddiWonal	
  SNPs	
   64,328	
  	
   2.17	
   18,672	
  	
   2.26	
  	
   99.1%	
   71.0%	
   2.13	
  
0.2%	
  
(88/51,126)	
  
0.2%	
  
(3/1265)	
  
• Called ~7% more SNP on off-target regions by using exome reads
• Ti/Tv and OMNI metrics showed reasonable quality
• Additional SNPs are mostly rare in phase1 calls or novel SNP
MAF	
  distribuVon	
  comparison	
  	
  
(afer	
  imputaVon)	
  
Both increasing sample size and adding exome reads increase SNP discovery
rate on the rare end (0.1% bin)
The	
  second	
  idea	
  
•  Using	
  exome	
  calls	
  to	
  refine	
  genotype	
  
imputaVon	
  
– Exome	
  calls	
  are	
  of	
  high	
  quality	
  and	
  independent	
  
from	
  AF	
  
– Exome	
  pipeline	
  addressed	
  plaborm/capture	
  
specific	
  errors.	
  
A	
  snapshot	
  of	
  Phase1	
  exome	
  SNP	
  
validaVon	
  results	
  
total	
   submiYed	
   yield	
   validated	
   validated/yield	
  
singleton	
   5372	
   100	
   93	
   92	
   98.9%	
  
<1%	
   4430	
   50	
   49	
   47	
   95.9%	
  
>1%	
   1896	
   50	
   46	
   46	
   100%	
  
SVM	
  overall	
   11698	
   200	
   188	
   185	
   98.4%	
  
Why <1% has the lowest validation rate?
•  Validation sample selection
•  Imputation artifacts
A	
  closer	
  look	
  at	
  imputaVon	
  arVfacts	
  
Chr	
   Pos	
   Site	
  source	
   AC	
  
Sample	
  picked	
  for	
  
validaWon	
  
PCR-­‐454	
  
validaWon	
  
Phase1	
  	
  
release	
  v1	
  
GL	
  in	
  log-­‐10	
  scale	
  RR/RA/AA	
  
Exome	
  calls	
  
(BCM)	
  
20	
   20033172	
   EX_SOLID	
  	
   singleton	
   NA19468	
  	
  (SOLID)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐5,-­‐0.000391054,-­‐3.04576	
   0/1	
  	
  
20	
   23667835	
   EX_ILLUMINA	
  	
   <1%	
   NA18510	
  (Illumina)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐5,-­‐0.00020851,-­‐3.31876	
   0/0	
  or	
  ./.	
  	
  
20	
   23667835	
   EX_ILLUMINA	
  	
   <1%	
   NA18858	
  (Illumina)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐2.72124,-­‐0.000825952,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   25478962	
   EX_ILLUMINA	
  	
   <1%	
   HG00104	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐5,0,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   25478962	
   EX_ILLUMINA	
  	
   <1%	
   HG00234	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐3.1549,-­‐0.000304111,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   25478962	
   EX_ILLUMINA	
  	
   <1%	
   HG00364	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐4.69838,-­‐8.69777e-­‐06,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   25478962	
   EX_ILLUMINA	
  	
   <1%	
   HG00593	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐3.1938,-­‐0.000278053,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   25478962	
   EX_ILLUMINA	
  	
   <1%	
   HG01271	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐0.31142,-­‐0.290883,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   60885811	
   EX_ILLUMINA	
  	
   <1%	
   HG00134	
  (SOLiD)	
   0/0	
  	
   0/1	
  	
   ./.:-­‐0.477139,-­‐0.477113,-­‐0.477113	
   0/0	
  or	
  ./.	
  	
  
20	
   60885811	
   EX_ILLUMINA	
  	
   <1%	
   HG00350	
  (SOLiD)	
  	
   0/0	
  	
   0/1	
  	
   ./.:-­‐0.123447,-­‐0.61343,-­‐2.41117	
   0/0	
  or	
  ./.	
  	
  
20	
   62326235	
   EX_ILLUMINA	
  	
   <1%	
   HG00128	
  (SOLiD)	
  	
   0/0	
  	
   0/1	
  	
   ./.:-­‐4.22169,-­‐2.6068e-­‐05,-­‐5	
   0/0	
  or	
  ./.	
  	
  
20	
   62326235	
   EX_ILLUMINA	
  	
   <1%	
   HG00179	
  (SOLiD)	
  	
   0/0	
  	
   0/1	
  	
  
./.:-­‐3.22182,-­‐0.000773747,-­‐2.9281
2	
  
0/0	
  or	
  ./.	
  	
  
SNPs were called in one sample but incorrectly imputed in other samples
IntegraVng	
  exome	
  genotypes	
  with	
  GL	
  	
  
Override generic GL
by exome and SNP
array genotypes
Future	
  work	
  
•  use	
  both	
  Illumina	
  and	
  SOLiD	
  exome	
  data	
  to	
  
assist	
  whole	
  genome	
  SNP	
  calling	
  in	
  next	
  
experiment	
  
•  integrate	
  exome	
  genotype	
  calls	
  in	
  whole	
  
genome	
  imputaVon	
  
Acknowledgements	
  
HGSC-­‐BCM	
  
•  Fuli	
  Yu	
  
•  Danny	
  Challis	
  
•  Uday	
  Evani	
  
•  Majhew	
  Baibridge	
  
•  Donna	
  Muzny	
  
•  Jeffrey	
  Reid	
  
•  Richard	
  Gibbs	
  
	
  
•  Yi	
  Wang	
  
BlueBiou@Rice	
  
•  Research	
  CompuVng	
  
group	
  
	
  

More Related Content

Viewers also liked

Actividad final
Actividad finalActividad final
Actividad finalUNAD
 
Devry card 405 week 4 portfolio draft
Devry card 405 week 4 portfolio draftDevry card 405 week 4 portfolio draft
Devry card 405 week 4 portfolio drafteddie hahn
 
Tarea 1 (1)
Tarea 1 (1)Tarea 1 (1)
Tarea 1 (1)UNAD
 
Cociñamos
CociñamosCociñamos
Cociñamoscorgomo
 
Evaluacion pares
Evaluacion paresEvaluacion pares
Evaluacion paresvillaaramsa
 
Campaña del reglamento estudiantil
Campaña del reglamento estudiantilCampaña del reglamento estudiantil
Campaña del reglamento estudiantilWilfrant Payán
 
Metodologia de la investigacion y el conocimiento
Metodologia de la investigacion y el conocimientoMetodologia de la investigacion y el conocimiento
Metodologia de la investigacion y el conocimientoAlejandradurt
 
Informe ejecutivo, TIC en el futuro
Informe ejecutivo, TIC en el futuroInforme ejecutivo, TIC en el futuro
Informe ejecutivo, TIC en el futuroraguilar68
 
Ficha desengrasante industrial marca perbois 38.
Ficha desengrasante industrial marca perbois 38.Ficha desengrasante industrial marca perbois 38.
Ficha desengrasante industrial marca perbois 38.Conservacion De Materiales
 
Tututututut 2
Tututututut 2Tututututut 2
Tututututut 2andrecyn
 
สัตว์สงวน
สัตว์สงวนสัตว์สงวน
สัตว์สงวนNareerat Wor
 
Resume Justin Ryan
Resume Justin RyanResume Justin Ryan
Resume Justin RyanJustin Ryan
 

Viewers also liked (16)

Actividad final
Actividad finalActividad final
Actividad final
 
Devry card 405 week 4 portfolio draft
Devry card 405 week 4 portfolio draftDevry card 405 week 4 portfolio draft
Devry card 405 week 4 portfolio draft
 
Evolucion web
Evolucion webEvolucion web
Evolucion web
 
Tarea 1 (1)
Tarea 1 (1)Tarea 1 (1)
Tarea 1 (1)
 
Cociñamos
CociñamosCociñamos
Cociñamos
 
Evaluacion pares
Evaluacion paresEvaluacion pares
Evaluacion pares
 
Campaña del reglamento estudiantil
Campaña del reglamento estudiantilCampaña del reglamento estudiantil
Campaña del reglamento estudiantil
 
Slideshare
SlideshareSlideshare
Slideshare
 
Metodologia de la investigacion y el conocimiento
Metodologia de la investigacion y el conocimientoMetodologia de la investigacion y el conocimiento
Metodologia de la investigacion y el conocimiento
 
Informe ejecutivo, TIC en el futuro
Informe ejecutivo, TIC en el futuroInforme ejecutivo, TIC en el futuro
Informe ejecutivo, TIC en el futuro
 
Ficha desengrasante industrial marca perbois 38.
Ficha desengrasante industrial marca perbois 38.Ficha desengrasante industrial marca perbois 38.
Ficha desengrasante industrial marca perbois 38.
 
Tututututut 2
Tututututut 2Tututututut 2
Tututututut 2
 
Presentación fonema P
Presentación fonema P Presentación fonema P
Presentación fonema P
 
Samain
SamainSamain
Samain
 
สัตว์สงวน
สัตว์สงวนสัตว์สงวน
สัตว์สงวน
 
Resume Justin Ryan
Resume Justin RyanResume Justin Ryan
Resume Justin Ryan
 

Similar to presentation in 1000 Genomes Phase2 meeting

4. Update on non-invasive prenatal testing
4. Update on non-invasive prenatal testing4. Update on non-invasive prenatal testing
4. Update on non-invasive prenatal testingPHEScreening
 
nonsyndromic orofacial cleft and palate
nonsyndromic orofacial cleft and palatenonsyndromic orofacial cleft and palate
nonsyndromic orofacial cleft and palatehad89
 
Presentation3
Presentation3Presentation3
Presentation3Darijiro
 
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...Shimadzu Scientific Instruments
 
Aug2014 use cases combined
Aug2014 use cases combinedAug2014 use cases combined
Aug2014 use cases combinedGenomeInABottle
 
Residential Air Filtration.ppt
Residential Air Filtration.pptResidential Air Filtration.ppt
Residential Air Filtration.pptAbhishekChavan77
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDemin Wang
 
LED Power Driver Tester
LED Power Driver TesterLED Power Driver Tester
LED Power Driver TesterLisun Group
 
Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015Caelie Kern
 
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)Jordi Labs
 
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...Centro Diagnostico Nardi
 
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...Centro Diagnostico Nardi
 
Led power-driver-tester
Led power-driver-testerLed power-driver-tester
Led power-driver-tester世满 江
 
Practical Implementation of the New Elemental Impurities Guidelines May 2015
Practical Implementation of the New Elemental Impurities Guidelines May 2015Practical Implementation of the New Elemental Impurities Guidelines May 2015
Practical Implementation of the New Elemental Impurities Guidelines May 2015SGS
 
Xevo TQ-XS Data: Quantifying Mid-size Proteins
Xevo TQ-XS Data: Quantifying Mid-size ProteinsXevo TQ-XS Data: Quantifying Mid-size Proteins
Xevo TQ-XS Data: Quantifying Mid-size ProteinsLarry Hines
 
Reducing Failure Analysis Time: An Industrial Evaluation
Reducing Failure Analysis Time: An Industrial EvaluationReducing Failure Analysis Time: An Industrial Evaluation
Reducing Failure Analysis Time: An Industrial EvaluationMojdeh Golagha
 
Mycotoxin monitoring program SGS
Mycotoxin monitoring program SGSMycotoxin monitoring program SGS
Mycotoxin monitoring program SGSFrancois Stepman
 

Similar to presentation in 1000 Genomes Phase2 meeting (20)

4. Update on non-invasive prenatal testing
4. Update on non-invasive prenatal testing4. Update on non-invasive prenatal testing
4. Update on non-invasive prenatal testing
 
nonsyndromic orofacial cleft and palate
nonsyndromic orofacial cleft and palatenonsyndromic orofacial cleft and palate
nonsyndromic orofacial cleft and palate
 
Presentation3
Presentation3Presentation3
Presentation3
 
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...
Fast, Sensitive, and Cost-effective Analysis of Trace Metals in Water by EPA ...
 
Aug2014 use cases combined
Aug2014 use cases combinedAug2014 use cases combined
Aug2014 use cases combined
 
Residential Air Filtration.ppt
Residential Air Filtration.pptResidential Air Filtration.ppt
Residential Air Filtration.ppt
 
Database Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago districDatabase Marketing - Dominick's stores in Chicago distric
Database Marketing - Dominick's stores in Chicago distric
 
LED Power Driver Tester
LED Power Driver TesterLED Power Driver Tester
LED Power Driver Tester
 
Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015Presentation for Phi Sigma Fall 2015
Presentation for Phi Sigma Fall 2015
 
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)
Jordi Labs Agilent Extractables & Leachables Webinar Presentation (Part 1)
 
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...
2009 bologna, af & chf congress, ablazione della fibrillazione atriale. obiet...
 
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...
2009 ferrara, congresso regionale, i tools da raggiungere nell'ablazione dell...
 
Led power-driver-tester
Led power-driver-testerLed power-driver-tester
Led power-driver-tester
 
Need_for_ATNA_1.pptx
Need_for_ATNA_1.pptxNeed_for_ATNA_1.pptx
Need_for_ATNA_1.pptx
 
Visual field
Visual fieldVisual field
Visual field
 
Practical Implementation of the New Elemental Impurities Guidelines May 2015
Practical Implementation of the New Elemental Impurities Guidelines May 2015Practical Implementation of the New Elemental Impurities Guidelines May 2015
Practical Implementation of the New Elemental Impurities Guidelines May 2015
 
20 x 60_econo
20 x 60_econo20 x 60_econo
20 x 60_econo
 
Xevo TQ-XS Data: Quantifying Mid-size Proteins
Xevo TQ-XS Data: Quantifying Mid-size ProteinsXevo TQ-XS Data: Quantifying Mid-size Proteins
Xevo TQ-XS Data: Quantifying Mid-size Proteins
 
Reducing Failure Analysis Time: An Industrial Evaluation
Reducing Failure Analysis Time: An Industrial EvaluationReducing Failure Analysis Time: An Industrial Evaluation
Reducing Failure Analysis Time: An Industrial Evaluation
 
Mycotoxin monitoring program SGS
Mycotoxin monitoring program SGSMycotoxin monitoring program SGS
Mycotoxin monitoring program SGS
 

presentation in 1000 Genomes Phase2 meeting

  • 1. Phase2  SNP  calling:     Low-­‐coverage  and  exome   Jin  Yu   2012/05/07  
  • 2. Overview   •  Two  ideas  in  Phase2  SNP  calling   – Use  exome  off-­‐target  reads  for  whole  genome   SNP  site  discovery   – Use  exome  genotype  calls  to  improve  overall   genotype  accuracy   •  Preliminary  results  and  plan  
  • 3. Review  of  phase1  SNP  pipeline   Low  coverage  (~4X)  WGS   BAMs   High  coverage  (~50X)  WES   BAMs   MulV-­‐sample  calling   Single-­‐sample  calling   PopulaVon  SNP  sites   Individual  SNP  sites  and   genotypes    Apply  mulV-­‐center  consensus  strategy  and  merge  SNP  sites   Impute  genotype/haplotype   Calculate  genotype  likelihood  on  all  candidate  sites  
  • 4. Two  ideas  in  Phase2   Low  coverage  (~4X)  WGS   BAMs   High  coverage  (~50X)  WES   BAMs   MulV-­‐sample  calling   Single-­‐sample  calling   PopulaVon  SNP  sites   Individual  SNP  sites  and   genotypes    Apply  mulV-­‐center  consensus  strategy  and  merge  SNP  sites   Impute  genotype/haplotype   Calculate  genotype  likelihood  on  all  candidate  sites  
  • 5. The  first  idea   •  Use  exome  off-­‐target  reads  in  whole  genome   SNP  calling   – Exome  off-­‐target  reads  have  significant  coverage   on  whole  genome  coverage   – Preliminary  results  showed  higher  SNP  sensiVvity   and  reasonable  quality  
  • 6. Exome  off-­‐target  reads  are  significant   0   5   10   15   20   NA11832   NA12144   NA12043   NA12489   NA12828   NA12749   NA12004   NA11993   NA11995   NA12273   NA06985   NA12272   NA12843   NA12890   NA12046   NA12341   NA12286   NA11992   NA12348   NA11918   NA12275   NA11893   NA12827   NA12718   NA11919   NA07048   NA12778   NA12546   NA12777   NA06989   NA12829   NA11881   NA12751   NA12717   NA12400   NA12340   NA11843   NA12154   NA12716   NA12155   NA12750   NA07037   NA12249   NA10847   NA11994   Average  coverage  on  off-­‐target  regions   exome   lowpass   0%   20%   40%   60%   80%   100%   NA11832   NA12144   NA12043   NA12489   NA12828   NA12749   NA12004   NA11993   NA11995   NA12273   NA06985   NA12272   NA12843   NA12890   NA12046   NA12341   NA12286   NA11992   NA12348   NA11918   NA12275   NA11893   NA12827   NA12718   NA11919   NA07048   NA12778   NA12546   NA12777   NA06989   NA12829   NA11881   NA12751   NA12717   NA12400   NA12340   NA11843   NA12154   NA12716   NA12155   NA12750   NA07037   NA12249   NA10847   NA11994   %  off-­‐target  reads  in  exome  capture  sequencing   •   ~50%  of  exome  capture  sequencing  reads  are  off-­‐target   •   off-­‐target  reads  add  ~1X  average  coverage  across  the  whole  genome  
  • 7. Exome  off-­‐target  reads  are  evenly  distributed   • Weighted  read  depths  calculated  using  EBD  in  5kb  sliding  windows  across  chr20    
  • 8. SNP  calling  experiment   •  Using  all  1449  phase2  lowpass  BAMs  and  1182   exome  Illumina  BAMs   •  Calling  model  modified  from  SNPtools   – Combining  reads  of  the  same  sample  to  esVmate   the  variance  of  true  variant  reads   – Grouping  reads  of  the  same  sequencing  plaborm   to  esVmate  the  variance  of  plaborm  specific  bias  
  • 9. SNP  calls  comparison     (chr20  off-­‐target  regions)   #SNP   Ti/Tv   #  in   Phase1   Known   Ti/Tv   %  Rare    (MAF<   1%)   %  Novel   to   Phase1   Novel   Ti/Tv   OMNI  poly   sensiWvity   OMNI  mono   False   discovery   BI  phase2   baseline   821,141   2.31   514,021     2.34     72.5%   37.4%   2.24   98.2%   (50,195/51,126)   0.9%   (12/1265)   BCM  phase2   baseline   847,274    2.33   502,517     2.42     68.6%   40.7%   2.20   98.6%   (50,406/51,126)   1.9%   (24/1265)   BCM  Phase2   experimental     911,602    2.32   521,189     2.42     69.7%   42.8%   2.19   98.8%   (50,494/51,126)   2.1%   (27/1265)   AddiWonal  SNPs   64,328     2.17   18,672     2.26     99.1%   71.0%   2.13   0.2%   (88/51,126)   0.2%   (3/1265)   • Called ~7% more SNP on off-target regions by using exome reads • Ti/Tv and OMNI metrics showed reasonable quality • Additional SNPs are mostly rare in phase1 calls or novel SNP
  • 10. MAF  distribuVon  comparison     (afer  imputaVon)   Both increasing sample size and adding exome reads increase SNP discovery rate on the rare end (0.1% bin)
  • 11. The  second  idea   •  Using  exome  calls  to  refine  genotype   imputaVon   – Exome  calls  are  of  high  quality  and  independent   from  AF   – Exome  pipeline  addressed  plaborm/capture   specific  errors.  
  • 12. A  snapshot  of  Phase1  exome  SNP   validaVon  results   total   submiYed   yield   validated   validated/yield   singleton   5372   100   93   92   98.9%   <1%   4430   50   49   47   95.9%   >1%   1896   50   46   46   100%   SVM  overall   11698   200   188   185   98.4%   Why <1% has the lowest validation rate? •  Validation sample selection •  Imputation artifacts
  • 13. A  closer  look  at  imputaVon  arVfacts   Chr   Pos   Site  source   AC   Sample  picked  for   validaWon   PCR-­‐454   validaWon   Phase1     release  v1   GL  in  log-­‐10  scale  RR/RA/AA   Exome  calls   (BCM)   20   20033172   EX_SOLID     singleton   NA19468    (SOLID)   0/0     0/1     ./.:-­‐5,-­‐0.000391054,-­‐3.04576   0/1     20   23667835   EX_ILLUMINA     <1%   NA18510  (Illumina)   0/0     0/1     ./.:-­‐5,-­‐0.00020851,-­‐3.31876   0/0  or  ./.     20   23667835   EX_ILLUMINA     <1%   NA18858  (Illumina)   0/0     0/1     ./.:-­‐2.72124,-­‐0.000825952,-­‐5   0/0  or  ./.     20   25478962   EX_ILLUMINA     <1%   HG00104  (SOLiD)   0/0     0/1     ./.:-­‐5,0,-­‐5   0/0  or  ./.     20   25478962   EX_ILLUMINA     <1%   HG00234  (SOLiD)   0/0     0/1     ./.:-­‐3.1549,-­‐0.000304111,-­‐5   0/0  or  ./.     20   25478962   EX_ILLUMINA     <1%   HG00364  (SOLiD)   0/0     0/1     ./.:-­‐4.69838,-­‐8.69777e-­‐06,-­‐5   0/0  or  ./.     20   25478962   EX_ILLUMINA     <1%   HG00593  (SOLiD)   0/0     0/1     ./.:-­‐3.1938,-­‐0.000278053,-­‐5   0/0  or  ./.     20   25478962   EX_ILLUMINA     <1%   HG01271  (SOLiD)   0/0     0/1     ./.:-­‐0.31142,-­‐0.290883,-­‐5   0/0  or  ./.     20   60885811   EX_ILLUMINA     <1%   HG00134  (SOLiD)   0/0     0/1     ./.:-­‐0.477139,-­‐0.477113,-­‐0.477113   0/0  or  ./.     20   60885811   EX_ILLUMINA     <1%   HG00350  (SOLiD)     0/0     0/1     ./.:-­‐0.123447,-­‐0.61343,-­‐2.41117   0/0  or  ./.     20   62326235   EX_ILLUMINA     <1%   HG00128  (SOLiD)     0/0     0/1     ./.:-­‐4.22169,-­‐2.6068e-­‐05,-­‐5   0/0  or  ./.     20   62326235   EX_ILLUMINA     <1%   HG00179  (SOLiD)     0/0     0/1     ./.:-­‐3.22182,-­‐0.000773747,-­‐2.9281 2   0/0  or  ./.     SNPs were called in one sample but incorrectly imputed in other samples
  • 14. IntegraVng  exome  genotypes  with  GL     Override generic GL by exome and SNP array genotypes
  • 15. Future  work   •  use  both  Illumina  and  SOLiD  exome  data  to   assist  whole  genome  SNP  calling  in  next   experiment   •  integrate  exome  genotype  calls  in  whole   genome  imputaVon  
  • 16. Acknowledgements   HGSC-­‐BCM   •  Fuli  Yu   •  Danny  Challis   •  Uday  Evani   •  Majhew  Baibridge   •  Donna  Muzny   •  Jeffrey  Reid   •  Richard  Gibbs     •  Yi  Wang   BlueBiou@Rice   •  Research  CompuVng   group