Your SlideShare is downloading. ×
0
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
09 apr2012 presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

09 apr2012 presentation

239

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
239
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Copy-numberVariations inLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenario Copy-number Variations in LymphoblastoidDataPipeline Cell LinesHow to detectCNVsFiltering:Filtering: Step I Step II Fei YuFiltering: StepIIIFiltering: Step Carnegie Mellon UniversityIVResultsConclusions April 4, 2012 Advisors: Bernie Devlin, Kathryn Roeder, Chad Schafer
  • 2. Copy-numberVariations in MotivationLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline • Advancement in DNA sequencing technology and rareHow to detect genetic diseases such as autism.CNVsFiltering: Step I • Data collection rush. 100,000 samples in 15 years.Filtering: Step IIFiltering: StepIII • Money. Time. Logistics.Filtering: StepIVResultsConclusions
  • 3. Motivation Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation • Advancement in DNA sequencing technology and rare genetic diseases such as autism. • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation A decade ago, people had few successes in finding genetic variants that cause rare diseases. One of the challenges was that they could only afford to look at small regions of the genome that they thought are linked to the disease. Today, as DNA sequencing technology develops, cheap and fast whole genome sequencing becomes available. Now, people can look at all the genes.
  • 4. Copy-numberVariations in MotivationLymphoblas- toid Cell Lines Fei Yu • Advancement in DNA sequencing technology and rare genetic diseases such as autism.MotivationA StrangeScenario • Data collection rush. 100,000 samples in 15 years.Data • Money. Time. Logistics.PipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 5. Motivation Copy-number Variations in Lymphoblastoid Cell Lines • Advancement in DNA sequencing technology and rare2012-04-04 genetic diseases such as autism. Motivation • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation The graph shows the cost of sequencing a genome over the past decade. In 2001, the cost was 100 Million, which is just prohibitively high. Today, a company called Illumina offers the service at $5000 per genome. They even give you a 20 % discount when you place an order of 50 genomes or more. The drastic drop in cost triggered a rush to collect as many DNA samples as possible. It is projected that in 15 years, we will have over 100,000 samples.
  • 6. Copy-numberVariations in MotivationLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline • Advancement in DNA sequencing technology and rareHow to detect genetic diseases such as autism.CNVsFiltering: Step I • Data collection rush. 100,000 samples in 15 years.Filtering: Step IIFiltering: StepIII • Money. Time. Logistics.Filtering: StepIVResultsConclusions
  • 7. Motivation Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation • Advancement in DNA sequencing technology and rare genetic diseases such as autism. • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation Despite the relatively low cost per genome, it still costs hundreds of millions to gather so many samples. Also, building infrastructures to store, maintain and distribute the data can cost as much money as that spent on sequencing. Furthermore, because these experiments involve human subjects, the researchers will also have to deal with obtaining permissions from the patients and safeguarding their privacy. All in all, it is a huge investment of our society’s resources.
  • 8. Copy-numberVariations in MotivationLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenario But there is one problem: most DNA sequencing projects useData lymphoblastoid cell line instead of peripheral blood.PipelineHow to detect Cell line - Immortal(!)CNVsFiltering: Step I - Cultivated from peripheral bloodFiltering: Step IIFiltering: StepIIIFiltering:IV Step Blood - Obtained from peripheral blood cellsResults consisting of red blood cells, white bloodConclusions cells, and platelet - Best source of the DNA - Mortal
  • 9. Motivation Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation But there is one problem: most DNA sequencing projects use lymphoblastoid cell line instead of peripheral blood. Cell line - Immortal(!) - Cultivated from peripheral blood Blood - Obtained from peripheral blood cells Motivation consisting of red blood cells, white blood cells, and platelet - Best source of the DNA - Mortal But there is one problem: most DNA sequencing projects use lymphoblastoid cell line instead of peripheral blood. Cell lines are immortal, so they are suitable for permanent storage. But they are products of peripheral blood cultivation. Blood data are obtained directly from peripheral blood cells consisting of red blood cells, white blood cells, and platelet. They are the best source of the DNA. However, because they are mortal, it is not practical to store them and use them in a later time. That’s why people use cell lines for sequencing.
  • 10. Copy-numberVariations in MotivationLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVs Are cell line data truthful representationsFiltering:Filtering:Filtering: Step I Step II Step of the DNA?IIIFiltering: StepIVResultsConclusions In other words, how close are cell line data to blood data?
  • 11. Motivation Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation Are cell line data truthful representations of the DNA? Motivation In other words, how close are cell line data to blood data? Our concern is whether cell line data are truthful representations of the DNA. In other words, we want to know how close cell line data are to blood data. If the cell lines are corrupted, any subsequent analyses will lose their bases, and all the time, money, and efforts invested on collecting these DNA samples would have gone to waste.
  • 12. Copy-numberVariations inLymphoblas- toid Cell Lines 1 Motivation Fei Yu A Strange ScenarioMotivationA StrangeScenario 2 DataDataPipeline PipelineHow to detectCNVsFiltering: Step I 3 How to detect CNVsFiltering: Step IIFiltering: Step Filtering: Step IIIIFiltering:IV Step Filtering: Step IIResults Filtering: Step IIIConclusions Filtering: Step IV Results 4 Conclusions
  • 13. Copy-numberVariations in Inference from Blood and Cell: A Strange ScenarioLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenario For a diploid organism (human):DataPipelineHow to detect Chromosome p1CNVsFiltering: Step I A BFiltering: Step IIFiltering: Step A AA ABIII Chromosome p2Filtering:IV Step B BA BBResultsConclusions Homozygous if AA or BB. Heterozygous if AB or BA.
  • 14. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation For a diploid organism (human): Chromosome p1 A Strange Scenario Chromosome p2 A A AA B AB B BA BB Inference from Blood and Cell: A Strange Scenario Homozygous if AA or BB. Heterozygous if AB or BA. For diploid organisms such at humans, chromosomes come in pairs. Each chromosome contains one copy of a gene. An allele is one of two or more forms of a gene. If both alleles on a pair of chromosomes are the same, we call the genetic locus homozygous; if the alleles are different, we call the genetic locus heterozygous.
  • 15. Copy-numberVariations in Inference from Blood and Cell: A Strange ScenarioLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline 1 = HeterozygousHow to detect 0 = HomozygousCNVsFiltering: Step I 1Filtering: Step IIFiltering: Step BloodIIIFiltering: Step Locations ...... 150 ......IV CellResultsConclusions 0
  • 16. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation 1 = Heterozygous 0 = Homozygous A Strange Scenario Blood 1 Locations ...... 150 ...... Cell Inference from Blood and Cell: A Strange Scenario 0 Denote a heterozygous locus by 1 and a homozygous locus by 0. The picture shows that at location 150, blood is heterozygous and cell line is homozygous.
  • 17. Copy-numberVariations in Inference from Blood and Cell: A Strange ScenarioLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 18. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Motivation A Strange Scenario Inference from Blood and Cell: A Strange Scenario If we only look at loci at which blood is heterozygous, we may encounter a situation depicted by this picture. There are consecutive homozygous loci in the cell line but they are heterozygous in the blood. This looks suspicious.
  • 19. Copy-numberVariations in Detour: What is Copy-number Variation?Lymphoblas- toid Cell Lines Fei Yu Copy-number variations (CNVs) correspond to relatively largeMotivation regions of the genome that have been deleted on aA StrangeScenario chromosome.DataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 20. Detour: What is Copy-number Variation? Copy-number Variations in Lymphoblastoid Cell Lines Copy-number variations (CNVs) correspond to relatively large2012-04-04 regions of the genome that have been deleted on a Motivation chromosome. A Strange Scenario Detour: What is Copy-number Variation? Now we take a detour and define copy-number variation. Copy-number variations (CNVs) correspond to relatively large regions of the genome that have been deleted on a chromosome. This picture shows the black region is deleted from the chromosome.
  • 21. Copy-numberVariations in Inference from Blood and Cell: A Strange ScenarioLymphoblas- toid Cell Lines Fei YuMotivation What a CNV in cell line looks like:A Strange Blood CellScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 22. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 What a CNV in cell line looks like: Motivation Blood Cell A Strange Scenario Inference from Blood and Cell: A Strange Scenario In this picture, the blood, which can be thought of as a representation of the DNA, is heterozygous. On the other hand, the cell line has the red region deleted. When we sequence the samples, we look at both chromosomes. But in this case, because the red region in the cell line is deleted, we can only sequence the remaining chromosome. As a result of the deletion, the cell line will always tell us this genetic locus is homozygous even though the DNA is heterozygous.
  • 23. Copy-numberVariations in Inference from Blood and Cell: A Strange ScenarioLymphoblas- toid Cell Lines Fei YuMotivation This could be a CNV!A StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 24. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 This could be a CNV! Motivation A Strange Scenario Inference from Blood and Cell: A Strange Scenario Let’s go back to this picture. This scenario fits the profile of a CNV. If this indeed happens in the cell line, we know the cell line is corrupted at that region.
  • 25. Copy-numberVariations in GoalLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVs Having CNVs in the cell line means the cell line is locallyFiltering: Step I corrupted. The goal of this project is to use the amount ofFiltering: Step IIFiltering:III Step CNVs to quantify how reliable the cell line is as a source ofFiltering:IV Step DNA.ResultsConclusions
  • 26. Copy-numberVariations in DataLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData The data we have:PipelineHow to detect • 16 individuals’ entire exomes sequenced by next-generationCNVsFiltering: Step I sequencing (NGS) technology.Filtering: Step IIFiltering:III Step • Each individual is sequenced twice: once using bloodFiltering:IV Step samples and the other time using cell line samples.ResultsConclusions
  • 27. Data Copy-number Variations in Lymphoblastoid Cell Lines2012-04-04 Data The data we have: • 16 individuals’ entire exomes sequenced by next-generation sequencing (NGS) technology. • Each individual is sequenced twice: once using blood Data samples and the other time using cell line samples. The data we have allow us to compare cell line data and blood data and answer of the questions of whether they are the same.
  • 28. Copy-numberVariations in PipelineLymphoblas- toid Cell Lines NGS Fei Yu blood and cell line BAM filesMotivation samplesA StrangeScenarioData GATK SamtoolsPipelineHow to detectCNVsFiltering: Step IFiltering: Step II VCF files additional locus-specificFiltering: Step informationIIIFiltering: StepIVResultsConclusions Python scripts Data ready for analysis
  • 29. Copy-numberVariations in Pipeline: NGSLymphoblas- toid Cell 3/28/12 pipeline1.svg Lines GATK VCF files Fei Yu NGS blood and cell line BAM files Python scripts Data ready for analysis samplesMotivation additional locus-specificA Strange Samtools informationScenarioData 3/28/12 ngs_demo_short.svgPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/pipeline1.svg 1/1
  • 30. Copy-numberVariations in Pipeline: NGSLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenario Next-generation sequencing (NGS) technologyDataPipelineHow to detect Advantages:CNVsFiltering: Step I • FastFiltering: Step IIFiltering:III Step • Cost-effectiveFiltering: StepIVResultsConclusions Disadvantages: • Short DNA reads fragments are randomly located =⇒ great challenge for fragment assembly and mapping
  • 31. Copy-numberVariations in Pipeline: BAM filesLymphoblas- toid Cell Lines Fei YuMotivation Our raw data are BAM files. Their sizes are huge:A StrangeScenario • encode the whole genome’s nucleotide alignmentsDataPipeline • also encode quality of each read for a given locus (a locusHow to detectCNVs can be covered by as many as 1000 reads)Filtering: Step IFiltering: Step IIFiltering:III Step Mt. Sinai VanderbiltFiltering: StepIV # of subjects 7 12Results # of subjects thatConclusions 1 2 have corrupted data Average file size 7.4 GiB 17 GiB Total size ≈ 85 GiB ≈ 340 GiB
  • 32. Copy-numberVariations in Pipeline: BAM filesLymphoblas- toid Cell Lines Fei YuMotivation Our raw data are BAM files. Their sizes are huge:A StrangeScenario • encode the whole genome’s nucleotide alignmentsDataPipeline • also encode quality of each read for a given locus (a locusHow to detectCNVs can be covered by as many as 1000 reads)Filtering: Step IFiltering: Step IIFiltering:III Step Mt. Sinai VanderbiltFiltering: StepIV # of subjects 7 12Results # of subjects thatConclusions 1 2 have corrupted data Average file size 7.4 GiB 17 GiB Total size ≈ 85 GiB ≈ 340 GiB
  • 33. Copy-numberVariations in Pipeline: BAM filesLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions
  • 34. Copy-numberVariations in Pipeline: GATK, SamtoolsLymphoblas- toid Cell Lines Fei YuMotivationA Strange 3/28/12 pipeline2.svgScenario GATK VCF filesData NGSPipeline blood and cell line BAM files Python scripts Data ready for analysis samplesHow to detect additional locus-specificCNVs Samtools informationFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering:IV Step • Genome Analysis Toolkit (GATK):Results - make inference from the BAM files and determine whetherConclusions a locus is homozygous or heterozygous. - apply different filters to obtain desired results. • Samtools: extract read-level information such as sequencing quality, alignment quality, read direction.
  • 35. Copy-numberVariations in Pipeline: GATK, SamtoolsLymphoblas- toid Cell Lines Fei YuMotivationA Strange 3/28/12 pipeline2.svgScenario GATK VCF filesData NGSPipeline blood and cell line BAM files Python scripts Data ready for analysis samplesHow to detect additional locus-specificCNVs Samtools informationFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering:IV Step • Genome Analysis Toolkit (GATK):Results - make inference from the BAM files and determine whetherConclusions a locus is homozygous or heterozygous. - apply different filters to obtain desired results. • Samtools: extract read-level information such as sequencing quality, alignment quality, read direction.
  • 36. Copy-numberVariations in Pipeline: GATK, SamtoolsLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline Processing time: ∼1 day.How to detectCNVs GATK outputs:Filtering: Step IFiltering: Step II [HEADER LINES]Filtering: StepIII #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878Filtering: Step chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99IV chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:25Results chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:10 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:Conclusions
  • 37. Copy-numberVariations in Pipeline: tidy upLymphoblas- toid Cell Lines Fei Yu 3/28/12 pipeline3.svgMotivation GATK VCF filesA Strange NGSScenario blood and cell line BAM files Python scripts Data ready for analysis samplesData additional locus-specificPipeline Samtools informationHow to detectCNVsFiltering: Step IFiltering: Step II Python scripts:Filtering: StepIIIFiltering: Step • extract useful information from GATK and Samtools’IVResults outputsConclusions • prepare data for analysis in R
  • 38. Copy-numberVariations inLymphoblas- toid Cell Lines 1 Motivation Fei Yu A Strange ScenarioMotivationA StrangeScenario 2 DataDataPipeline PipelineHow to detectCNVsFiltering: Step I 3 How to detect CNVsFiltering: Step IIFiltering: Step Filtering: Step IIIIFiltering:IV Step Filtering: Step IIResults Filtering: Step IIIConclusions Filtering: Step IV Results 4 Conclusions
  • 39. Copy-numberVariations in NotationsLymphoblas- toid Cell Lines Fei YuMotivation Let T denote the zygosity of a genetic locusA StrangeScenario 1 if the locus is heterozygousData T =Pipeline 0 if the locus is homozygousHow to detectCNVs Let G denote the zygosity called by GATK.Filtering: Step IFiltering: Step IIFiltering:III Step 1 if the call is heterozygousFiltering: Step G=IV 0 if the call is homozygousResultsConclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 40. Copy-numberVariations in NotationsLymphoblas- toid Cell Lines Fei YuMotivation Let T denote the zygosity of a genetic locusA StrangeScenario 1 if the locus is heterozygousData T =Pipeline 0 if the locus is homozygousHow to detectCNVs Let G denote the zygosity called by GATK.Filtering: Step IFiltering: Step IIFiltering:III Step 1 if the call is heterozygousFiltering: Step G=IV 0 if the call is homozygousResultsConclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 41. Copy-numberVariations in NotationsLymphoblas- toid Cell Lines Fei YuMotivation Let T denote the zygosity of a genetic locusA StrangeScenario 1 if the locus is heterozygousData T =Pipeline 0 if the locus is homozygousHow to detectCNVs Let G denote the zygosity called by GATK.Filtering: Step IFiltering: Step IIFiltering:III Step 1 if the call is heterozygousFiltering: Step G=IV 0 if the call is homozygousResultsConclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 42. Copy-numberVariations in Distribution of (GB , GC ) ILymphoblas- toid Cell Lines Fei Yu We can describe the distribution of the observations (GB , GC )Motivation in four cases:A StrangeScenarioDataPipeline (I) TB = TC = 0How to detectCNVs Cell callFiltering:Filtering: Step I Step II 0 1Filtering:III Step 0 (1 − f+ )2 (1 − f+ )f+Filtering: Step Blood call 2IV 1 f+ (1 − f+ ) f+ResultsConclusions (II) TB = 0, TC = 1 (i.e., a mutation) Cell call 0 1 0 (1 − f+ )f− 2 (1 − f+ )(1 − f− ) Blood call 1 f+ f− f+ (1 − f− )
  • 43. Copy-numberVariations in Distribution of (GB , GC ) IILymphoblas- toid Cell Lines Fei YuMotivation (III) TB = 1, TC = 0 (i.e., a deletion)A StrangeScenario Cell callDataPipeline 0 1How to detect 0 f− (1 − f+ ) f− f+CNVs Blood callFiltering: Step I 1 (1 − f− )(1 − f+ ) (1 − f− )f+Filtering: Step IIFiltering: StepIIIFiltering: Step (IV) TB = TC = 1 (i.e., not a deletion)IVResults Cell callConclusions 0 1 0 f−2 f− (1 − f− ) Blood call 1 (1 − f− )f− (1 − f− )2
  • 44. Copy-numberVariations inLymphoblas- Probability of observing (GB = 1, GC = 0) in each of the four toid Cell Lines possible cases. Fei YuMotivationA StrangeScenarioDataPipeline TB=0 TB=1How to detectCNVsFiltering: Step IFiltering: Step II TC=0 TC=1 Deletion (TC=0) No deletion (TC=1)Filtering: StepIIIFiltering: StepIVResultsConclusions Case I Case II Case III Case IV
  • 45. Copy-number Variations in Lymphoblastoid Cell Lines Probability of observing (GB = 1, GC = 0) in each of the four possible cases.2012-04-04 How to detect CNVs TB=0 TB=1 TC=0 TC=1 Deletion (TC=0) No deletion (TC=1) Case I Case II Case III Case IV Let’s focus on the (GB = 1, GC = 0) observations and find out which observations indeed come from CNVs.
  • 46. Copy-numberVariations in More on GATKLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData GATK takes into account the number of each type ofPipeline nucleotide acid, read quality, and mapping quality of a geneticHow to detectCNVs locus to make inference on its true .Filtering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResults But the inference is not always accurate. Luckily, we canConclusions control how GATK makes mistakes, which I will explain in a moment
  • 47. Copy-numberVariations in More on GATKLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData GATK takes into account the number of each type ofPipeline nucleotide acid, read quality, and mapping quality of a geneticHow to detectCNVs locus to make inference on its true .Filtering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResults But the inference is not always accurate. Luckily, we canConclusions control how GATK makes mistakes, which I will explain in a moment
  • 48. Copy-numberVariations in FilteringLymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData Outline:Pipeline 1 Use GATK to minimize Case II and Case IV by controllingHow to detectCNVs threshold parameters that reduce f− at the expense ofFiltering:Filtering: Step I Step II allowing a larger f+ .Filtering:III Step 2 Filter the variants called in the previous step and eliminateFiltering: StepIV calls with lower quality metrics. By reducing f+ , we canResultsConclusions eliminate many variants in Case I. 3 Use hypothesis tests to pick out Case III candidate loci. 4 Fit the candidate loci to a hidden Markov model to pick out the most likely candidate loci.
  • 49. Copy-numberVariations in Filtering: Step ILymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData • Run GATK with low threshold parameters to obtain aPipelineHow to detect crude set of loci.CNVs • Effects: f− ≈ 0, increase f+ .Filtering: Step IFiltering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.Filtering: StepIIIFiltering: Step • f+ is bounded above by a small number:IVResults ˆ #(1, 0) + #(0, 1)Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 50. Copy-numberVariations in Filtering: Step ILymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData • Run GATK with low threshold parameters to obtain aPipelineHow to detect crude set of loci.CNVs • Effects: f− ≈ 0, increase f+ .Filtering: Step IFiltering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.Filtering: StepIIIFiltering: Step • f+ is bounded above by a small number:IVResults ˆ #(1, 0) + #(0, 1)Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 51. Copy-numberVariations in Filtering: Step ILymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData • Run GATK with low threshold parameters to obtain aPipelineHow to detect crude set of loci.CNVs • Effects: f− ≈ 0, increase f+ .Filtering: Step IFiltering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.Filtering: StepIIIFiltering: Step • f+ is bounded above by a small number:IVResults ˆ #(1, 0) + #(0, 1)Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 52. Copy-numberVariations in Filtering: Step ILymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData • Run GATK with low threshold parameters to obtain aPipelineHow to detect crude set of loci.CNVs • Effects: f− ≈ 0, increase f+ .Filtering: Step IFiltering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.Filtering: StepIIIFiltering: Step • f+ is bounded above by a small number:IVResults ˆ #(1, 0) + #(0, 1)Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 53. Copy-numberVariations in Filtering: Step ILymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IVMotivationA StrangeScenarioData • Run GATK with low threshold parameters to obtain aPipelineHow to detect crude set of loci.CNVs • Effects: f− ≈ 0, increase f+ .Filtering: Step IFiltering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.Filtering: StepIIIFiltering: Step • f+ is bounded above by a small number:IVResults ˆ #(1, 0) + #(0, 1)Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 54. Copy-numberVariations inLymphoblas- toid Cell Figure: KS-tests for runs of 1s against the gamma distribution. Shape and scale Lines parameters for gamma are estimated for each chromosome and for each Fei Yu individual. Those cells with less than 20 runs are indicated by “-”. Cells with p-value > 0.05 are colored grey.MotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions • Runs of 1s are interrupted randomly by short runs of 0s. • Many of the 0 calls are just random noise.
  • 55. Copy-numberVariations in Filtering: Step IILymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0Motivation Case I Case IIIA StrangeScenarioDataPipelineHow to detectCNVs • Run GATK’s Variant Quality Score Recalibration (VQSR)Filtering:Filtering: Step I Step II to filter out the false positive calls (loci in Case I).Filtering: StepIII • VQSR: fit a Gaussian Mixture Model to known variantsFiltering: StepIVResults and novel variants; filter based on the score of the variants.Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 56. Copy-numberVariations in Filtering: Step IILymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0Motivation Case I Case IIIA StrangeScenarioDataPipelineHow to detectCNVs • Run GATK’s Variant Quality Score Recalibration (VQSR)Filtering:Filtering: Step I Step II to filter out the false positive calls (loci in Case I).Filtering: StepIII • VQSR: fit a Gaussian Mixture Model to known variantsFiltering: StepIVResults and novel variants; filter based on the score of the variants.Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 57. Copy-numberVariations inLymphoblas- toid Cell An important covariate for VQSR is strand bias. Lines Fei Yu DNA’s double helix structure: forward and backward strandsMotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions Definition Strand bias is the tendency of making more variant calls on one direction than the other.
  • 58. Copy-numberVariations in Quantifying Strand BiasLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData n1· n2· n··Pipeline • Fisher’s exact test: p = /How to detect n11 n21 n·1CNVsFiltering: Step IFiltering: Step II Forward BackwardFiltering: StepIII Reference n11 n12 n1·Filtering: StepIVResults Alternative n21 n22 n2·Conclusions n·1 n·2 n··
  • 59. Copy-numberVariations in Filtering: Step IILymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0Motivation Case I Case IIIA StrangeScenarioDataPipelineHow to detectCNVs • Run GATK’s Variant Quality Score Recalibration (VQSR)Filtering:Filtering: Step I Step II to filter out the false positive calls (loci in Case I).Filtering: StepIII • VQSR: fit a Gaussian Mixture Model to known variantsFiltering: StepIVResults and novel variants; filter based on the score of the variants.Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 60. Copy-numberVariations in Filtering: Step IILymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0Motivation Case I Case IIIA StrangeScenarioDataPipelineHow to detectCNVs • Run GATK’s Variant Quality Score Recalibration (VQSR)Filtering:Filtering: Step I Step II to filter out the false positive calls (loci in Case I).Filtering: StepIII • VQSR: fit a Gaussian Mixture Model to known variantsFiltering: StepIVResults and novel variants; filter based on the score of the variants.Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 61. Copy-numberVariations in Filtering: Step IIILymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case IIIMotivationA StrangeScenarioData • For each locus, do hypothesis test:PipelineHow to detectCNVs H0 : TB = TC H1 : TB = TCFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: Step • Logistic regression:IVResults IG =1 ∼ Iisblood + strand directionConclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 62. Copy-numberVariations in Filtering: Step IIILymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case IIIMotivationA StrangeScenarioData • For each locus, do hypothesis test:PipelineHow to detectCNVs H0 : TB = TC H1 : TB = TCFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: Step • Logistic regression:IVResults IG =1 ∼ Iisblood + strand directionConclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 63. Copy-numberVariations in Features of the DataLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVs • from blood or cell lineFiltering: Step IFiltering: Step II • strand direction (forward or backward)Filtering: StepIIIFiltering: Step • sequencing qualityIVResultsConclusions
  • 64. Copy-numberVariations in Features of the DataLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipelineHow to detectCNVs • from blood or cell lineFiltering: Step IFiltering: Step II • strand direction (forward or backward)Filtering: StepIIIFiltering: Step • sequencing qualityIVResultsConclusions
  • 65. Copy-numberVariations in Sequencing QualityLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData Quality is inversely related to P( error ).Pipeline • base quality: quality of a read at a genetic locus;How to detectCNVs determined by the sequencing equipment.Filtering: Step IFiltering: Step II • mapping quality: alignment quality of a read; calculatedFiltering: StepIIIFiltering: Step from base qualities and the reference sequenceIVResults base quality + mapping quality =⇒ genotypeConclusions likelihood—likelihood of a locus being homozygous or heterozygous.
  • 66. Copy-numberVariations in Sequencing QualityLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioData Quality is inversely related to P( error ).Pipeline • base quality: quality of a read at a genetic locus;How to detectCNVs determined by the sequencing equipment.Filtering: Step IFiltering: Step II • mapping quality: alignment quality of a read; calculatedFiltering: StepIIIFiltering: Step from base qualities and the reference sequenceIVResults base quality + mapping quality =⇒ genotypeConclusions likelihood—likelihood of a locus being homozygous or heterozygous.
  • 67. Copy-numberVariations in Logistic RegressionLymphoblas- toid Cell Lines Fei Yu IG =1 ∼ Iisblood + strand directionMotivation + base quality + mapping directionA StrangeScenario • Each locus is fit to a logistic regression model.DataPipeline • Perform the deviance χ2 goodness-of-fit test for eachHow to detect model and we see only 2.4% of the tests are significant atCNVsFiltering: Step I 5%-level.Filtering: Step IIFiltering: StepIII Histogram of p−values from the Chi^2 tests of the residual devianceFiltering: Step 600IVResults 500Conclusions 400 Frequency 300 200 100 0 0.0 0.2 0.4 0.6 0.8 1.0 p−values
  • 68. Copy-numberVariations in Filtering: Step IIILymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case IIIMotivationA StrangeScenarioData • For each locus, do hypothesis test:PipelineHow to detectCNVs H0 : TB = TC H1 : TB = TCFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: Step • Logistic regression:IVResults IG =1 ∼ Iisblood + strand directionConclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 69. Copy-numberVariations in Filtering: Step IVLymphoblas- toid Cell Lines Fei Yu Did the Case III candidates come from CNVs?MotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions Define the length of a run of 0s as the number of consecutive (GB , GC ) = (1, 0) calls.
  • 70. Copy-numberVariations in Filtering: Step IVLymphoblas- toid Cell Lines Fei Yu Did the Case III candidates come from CNVs?MotivationA StrangeScenarioDataPipelineHow to detectCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResultsConclusions Define the length of a run of 0s as the number of consecutive (GB , GC ) = (1, 0) calls.
  • 71. Copy-numberVariations in Filtering: Step IVLymphoblas- toid Cell Lines Fei Yu Density estimate of the lengths of runs of (G_B, G_C)=(1,0) callsMotivation 2.5A StrangeScenarioData 2.0PipelineHow to detectCNVsFiltering: Step I 1.5Filtering: Step II DensityFiltering: StepIIIFiltering: StepIV 1.0Results > 95% quantileConclusions 0.5 0.0 2 4 6 8 10 12 14 N = 3286 Bandwidth = 0.127
  • 72. Copy-numberVariations in Filtering: Step IVLymphoblas- toid Cell Lines Fei Yu 10 loci come from runs of 0s of length at least 3:MotivationA StrangeScenario 1101111111|000|1111111111DataPipeline 1010111011|000|1111111111How to detect 1111111011|000|1111110111CNVsFiltering: Step I 0011111011|000|1101111111Filtering: Step IIFiltering:III Step 1111111111|000|1111111111Filtering:IV Step 1101110111|000|1111111111Results 1011111001|000|1111111010Conclusions 1111111111|000|1111011110 1111111111|000|1111111111 1111111111|000|1101111011 Notice short runs of 1s. Are they errors?
  • 73. Copy-numberVariations in (Future Work) Filtering: Step IVLymphoblas- toid Cell Lines Fei Yu Find probability of < 1011111001|000|1111111010 > usingMotivation hidden Markov model: 3/30/12 hmm.svgA StrangeScenarioDataPipelineHow to detect CNV not CNVCNVsFiltering: Step IFiltering: Step IIFiltering: StepIIIFiltering: StepIVResults mismatched (0) matched (1)Conclusions file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/hmm.svg 1/1 CNV not CNV CNV γ 1−γ Pi,i+1 = not CNV 1−λ λ where γ and λ are big.
  • 74. Copy-numberVariations in ResultsLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline • After a series of filtering, only 10 loci in the pool of 16How to detectCNVs individuals are found to be CNV candidates.Filtering: Step IFiltering:Filtering: Step II Step • Those 10 loci fall into short runs of 0s. They are unlikelyIIIFiltering: Step to be CNVs.IVResults • We will fit HMM when there are more reliable signals.Conclusions
  • 75. Copy-numberVariations in ResultsLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline • After a series of filtering, only 10 loci in the pool of 16How to detectCNVs individuals are found to be CNV candidates.Filtering: Step IFiltering:Filtering: Step II Step • Those 10 loci fall into short runs of 0s. They are unlikelyIIIFiltering: Step to be CNVs.IVResults • We will fit HMM when there are more reliable signals.Conclusions
  • 76. Copy-numberVariations in ConclusionsLymphoblas- toid Cell Lines Fei YuMotivationA StrangeScenarioDataPipeline • No CNV is good news. We now know a great amount ofHow to detectCNVs time, money, and effort have not gone to waste.Filtering: Step IFiltering:Filtering: Step II Step • A useful assessment procedure when labs create cell lines.IIIFiltering:IV Step • In a separate work, we extended this procedure to findingResults mutation in cell line, i.e., TB = 0, TC = 1.Conclusions

×