Successfully reported this slideshow.
Your SlideShare is downloading. ×

Ashg sedlazeck grc_share

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 28 Ad

More Related Content

Slideshows for you (20)

Similar to Ashg sedlazeck grc_share (20)

Advertisement

More from Genome Reference Consortium (16)

Recently uploaded (20)

Advertisement

Ashg sedlazeck grc_share

  1. 1. Structural Variation Characterization Across the Human Genome and Populations Fritz Sedlazeck October, 17, 2017
  2. 2. Scientific interests Detection of Variants Sniffles (in bioRxiv) SURVIVOR Jeffares et. al. (2017) BOD-Score Sedlazeck et.al.(2013) Mapping/ Assembly reads NextGenMap-LR (in bioRxiv) Falcon Unzip Chin et.al. (2016) NextGenMap Sedlazeck et.al. (2013) Benchmarking/ Biases DangerTrack Dolgalev et.al. (2017) Teaser Smolka et.al. (2015) Sequencing Jünemann et.al. (2013) Applications Model organisms: -Cancer (SKBR3) (in bioRxiv) -miRNA editing (Vesely et.al. 2012) Non Model organisms: -Cottus transposons (Dennenmoser et. al. 2017) -Clunio (Kaiser et. al. 2016) -Seabass (Vij et.al. 2016) -Pineapple (Ming et.al. 2015) “moonlight”'
  3. 3. Structural Variations Genomic DisordersEvolution Impact on regulation Impact on phenotypes RegulatoryState Cell Line A 54 9A o rta B _ ce lls_ P B _R o ad m ap C D 1 4C D 16 __ m on ocyte_ C B C D 14 C D 16 __m ono cyte _V B C D 4 _a b_ T _ cell_ V B C D 8_a b_ T _ ce ll_C B C M _C D 4 _ab _T _cell_ V B D N D _4 1 e osin o ph il_V B E P C _V B eryth rob la st_C B F e ta l_ A dren al_ G la n d F e tal_ Intestin e_L arg e F etal_ In te stin e_ S m all F e ta l_ M u scle _ Le g F etal_ M uscle _T runk F etal_ S tom a ch F e tal_ T hym us G astric G M 12 87 8 H 1_ m esenchym al H 1_ ne uron al_ p rog en itor H 1_ troph ob la st H 1 E S C H 9 H e La _S 3 H e pG 2H M E CH S M M H S M M tub e H U V E C _p ro l_ C B H U V E CIM R 90 iP S _2 0b iP S _D F _19 _1 1 iP S _D F _6 _9K 56 2 Le ft_V e ntric leL un g M 0_ m acro ph ag e_ C B M 0_ m acrop hag e_ V B M 1_m acro ph age _C B M 1_ m acro ph ag e_ V B M 2 _m a crop ha ge _C B M 2_ m acro ph ag e_ V B M ono cyte s_C D 1 4_ P B _R o ad m ap M on ocyte s_ C D 1 4 M S C _V B n aiv e _B _ce ll_ V B N a tu ral_ K ille r_cells_P B ne utrop hil_ C B n eutrop hil_m ye lo cyte _B M n eu tro ph il_V BN H _A N H D F _A DN H E KN H LF O steob l O vary P an crea s P la ce nta P soa s_M uscle R ig ht_A triu m S m all_ Intestin e S ple e n T _cells_P B _R oa dm a p T hym us C T C F _b in din g _siteA C T IV E C T C F _ bin d in g_ site IN A C T IV E C T C F _bin d in g_ site P O IS E D C T C F _ bin d in g_ site R E P R E S S E D e nha ncerA C T IV E e nh an ce rIN A C T IV E en han ce rP O IS E D e nh an cerR E P R E S S E D op en _chrom a tin _reg io nA C T IV E o pe n_ chro m atin _ re gio n IN A C T IV E o pe n_ chro m atin _re gio n N A ope n_ ch ro m atin _ regio n P O IS E D o pe n_ chro m atin _re gio n R E P R E S S E D p rom o te rA C T IV E pro m oter_ fla n kin g _reg io nA C T IV E p rom o te r_fla nkin g_ re gio n IN A C T IV E p rom o te r_fla nkin g_ regio n P O IS E D p ro m o te r_fla nkin g_re gio n R E P R E S S E D prom oterIN A C T IV E pro m oterP O IS E D prom oterR E P R E S S E D T F _b in din g _siteA C T IV E T F _ bin d in g_ site IN A C T IV E T F _ bin d in g_ site N A T F _ bin d in g_ site P O IS E D T F _ bin d in g_ site R E P R E S S E D A 54 9A o rta B _ ce lls_ P B _R o ad m ap C D 1 4C D 16 __ m on ocyte_ C B C D 14 C D 16 __m ono cyte _V B C D 4 _a b_ T _ cell_ V B C D 8_a b_ T _ ce ll_C B C M _C D 4 _ab _T _cell_ V B D N D _4 1 e osin o ph il_V B E P C _V B eryth rob la st_C B F e ta l_ A dren al_ G la n d F e tal_ Intestin e_L arg e F etal_ In te stin e_ S m all F e ta l_ M u scle _ Le g F etal_ M uscle _T runk F etal_ S tom a ch F e tal_ T hym us G astric G M 12 87 8 H 1_ m esenchym al H 1_ ne uron al_ p rog en itor H 1_ troph ob la st H 1 E S C H 9 H e La _S 3 H e pG 2H M E CH S M M H S M M tub e H U V E C _p ro l_ C B H U V E CIM R 90 iP S _2 0b iP S _D F _19 _1 1 iP S _D F _6 _9K 56 2 Le ft_V e ntric leL un g M 0_ m acro ph ag e_ C B M 0_ m acrop hag e_ V B M 1_m acro ph age _C B M 1_ m acro ph ag e_ V B M 2 _m a crop ha ge _C B M 2_ m acro ph ag e_ V B M ono cyte s_C D 1 4_ P B _R o ad m ap M on ocyte s_ C D 1 4 M S C _V B n aiv e _B _ce ll_ V B N a tu ral_ K ille r_cells_P B ne utrop hil_ C B n eutrop hil_m ye lo cyte _B M n eu tro ph il_V BN H _A N H D F _A DN H E KN H LF O steob l O vary P an crea s P la ce nta P soa s_M uscle R ig ht_A triu m S m all_ Intestin e S ple e n T _cells_P B _R oa dm a p T hym us C T C F _b in din g _siteA C T IV E C T C F _ bin d in g_ site IN A C T IV E C T C F _bin d in g_ site P O IS E D C T C F _ bin d in g_ site R E P R E S S E D e nha ncerA C T IV E e nh an ce rIN A C T IV E en han ce rP O IS E D e nh an cerR E P R E S S E D op en _chrom a tin _reg io nA C T IV E o pe n_ chro m atin _ re gio n IN A C T IV E o pe n_ chro m atin _re gio n N A ope n_ ch ro m atin _ regio n P O IS E D o pe n_ chro m atin _re gio n R E P R E S S E D p rom o te rA C T IV E pro m oter_ fla n kin g _reg io nA C T IV E p rom o te r_fla nkin g_ re gio n IN A C T IV E p rom o te r_fla nkin g_ regio n P O IS E D p ro m o te r_fla nkin g_re gio n R E P R E S S E D prom oterIN A C T IV E pro m oterP O IS E D prom oterR E P R E S S E D T F _b in din g _siteA C T IV E T F _ bin d in g_ site IN A C T IV E T F _ bin d in g_ site N A T F _ bin d in g_ site P O IS E D T F _ bin d in g_ site R E P R E S S E D 0500100015002000 scale affected#
  4. 4. Diploid genome • Impact on Regulation • Variability of genes • Need to understand the full structure
  5. 5. Challenges: Pursuing the diploid genome 1. Accurate prediction of SVs 2. Comparison of SVs 3. Annotation and interpretation of SVs 4. Population analysis 5. Diploid Genome Layer et.al. (2014)
  6. 6. 1.1 How to detect Structural Variations (SVs)
  7. 7. • (+) SVs in repetitive regions • (+) Span SVs • (+) Uniform coverage • (+) Can identify more complex SVs • (-) Higher seq. error rate • (-) Hard to align 1.1 Long Read Technologies
  8. 8. 1.1 Accurate mapping and SV calling NextGenMap-LR (NGMLR): • Long read mapper • Convex gap costs • Faster then BWA-MEM Sniffles: • SV caller for long reads • All types of SVs • Phasing of SVs
  9. 9. 1.2 NA12878: SV calling Tech. Cover age Avg read len Method SVs TRA PacBio 55x 4,334 Sniffles 22,877 119 Oxford Nanopore @Baylor 34x 4,982 Sniffles 12,596 46 Illumina 50x 2 x 101 Manta, Delly, Lumpy 7,275 2,247 Sedlazeck et.al. (2017)
  10. 10. 1.1 NA12878: SV calling Tech. Cove rage Avg read len Method SVs TRA DEL INS PacBio 55x 4,334 Sniffles 22,877 119 9,933 12,052 Oxford Nanopore @Baylor 34x 4,982 Sniffles 12,596 46 7,102 5,166 Illumina 50x 2 x 101 Manta, Delly, Lumpy 7,275 2,247 3,744 0 Sedlazeck et.al. (2017)
  11. 11. 1.1 NA12878: check 2,247 vs 119 TRA Illumina data Translocation: PacBio data ONT data Truncated reads: Insertion In rep. region Overlap Illumina TRA(%) Insertions 53.05 Deletions 12.06 Duplications 0.57 Nested 0.31 High coverage 1.87 Low complexity 9.79 Explained 77.65 Sedlazeck et.al. (2017)
  12. 12. 1.1 NA12878: check 2,247 vs 119 TRA ONT data PacBio data Illumina data Insertion In rep. region Inversion: Translocation: Truncated reads: Insertion In rep. region Sedlazeck et.al. (2017)
  13. 13. 1.2 More complex SVs Inverted tandem duplication: • Pelizaeus-Merzbacher disease • MECP2 • VIPR2 Sedlazeck et.al. (2017) PacBio data Illumina data
  14. 14. 1.2 More complex SVs Inversion flanked by deletions: • Haemophilia A • Only found over long range PCR! (2007) Sedlazeck et.al. (2017) Illumina data PacBio data
  15. 15. Challenges 1. Accurate prediction of SVs: Sniffles (talk on Thursday!) 2. Comparison of SVs 3. Annotation and interpretation of SVs 4. Population analysis 5. Diploid Genome Layer et.al. (2014)
  16. 16. 2. Comparison of SVs SURVIVOR Framework: • Compare SVs • GiaB: 95 vcf file: 1 minute • Simulate SVs • Simulate long reads • Summarize SVs results Jeffares et.al. (2017) New SVs Observed SVs
  17. 17. 2. Genome in a Bottle: merging 95 vcfs (1 min) 10x Genomics BioNano Complete Genomics Illumina PacBio Minimum 2 callers:SV Caller Comparison: Using PCR+Sanger validate SVs form multiple categories. Join CSHL + Baylor to help with validations!
  18. 18. Challenges 1. Accurate prediction of SVs: Sniffles (talk on Thursday!) 2. Comparison of SVs: SURIVOR 3. Annotation and interpretation of SVs 4. Population analysis 5. Diploid Genome
  19. 19. Histogram over genes impacted #Gene hit by SVS Frequency 0 20 40 60 80 0200040006000 3. Annotation: SURVIVOR_ant Annotating SVs with: • Multiple GTF, BED, VCF Genome in a Bottle: • 63,677 genes (GTF) • 1,733,686 regions (3 bed files) • 22 seconds: • 8,314 Genes impacted Sedlazeck et.al. (2017) #Genes # SV hit gene Genes impacted by SVs
  20. 20. Challenges 1. Accurate prediction of SVs: Sniffles (talk on Thursday!) 2. Comparison of SVs: SURIVOR 3. Annotation and interpretation of SVs: SURVIVOR_ANT 4. Population analysis 5. Diploid Genome
  21. 21. 4. SVs in Population: SURVIVOR • Birth defect study (Karyn Meltz Steinberg, WashU: Wed. 9am: Room 310A) • 4 callers, 114 samples • CCDG (William Salerno, HGSC: poster on Friday, #1281) • 5 callers, 22,600 samples • Non human: • S. Pombe: 3 callers, 161 samples • Tomato: 3 callers, 846 samples
  22. 22. 4. SVs in 22,600 Individuals We need large SV studies: • Common vs. rare SVs • Inform GWAS studies • Ethnicity specific SVs • Catalog variability of regions • MHC, LPA, etc. 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.000.100.20 CHR6: Average SV Allele Frequency per 100kb Allelefrequency MHC LPA #SVs Shared across individuals Position
  23. 23. Challenges 1. Accurate prediction of SVs: Sniffles (talk on Thursday!) 2. Comparison of SVs: SURIVOR 3. Annotation and interpretation of SVs: SURVIVOR_ANT 4. Population analysis: SURVIVOR 5. Diploid Genome
  24. 24. 5.1 Diploid Genome Challenges: • Sequencing technology • Computational methods • Money HGSC Approach: GADGET 1. Sequence 100 individuals: PacBio + 10x Genomics 2. SV detection/genotyping 3. Phasing of SVs+ SNP 4. Population based genotyping of SVs short reads.
  25. 25. 5.2 Diploid Genome Selecting 100 samples • We want to maximize the outcome/ $ spent • Selection of samples (red) • Select top 100 (red) • Random selection of samples (boxplot) Histogram of mat[, 2] # SVS #Patients 2e+04 4e+04 6e+04 8e+04 1e+05 050100150200250 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 020406080100 Random vs. informed choice of samples (CCDG) # of chosen Samples SVinpopulation(%) Informed Top100 Random Number of chosen samples SVinpopulation(%)
  26. 26. 5.3 Diploid Genome (Prototype)
  27. 27. Challenges/ Summary 1. Accurate prediction of SVs: Sniffles (Talk on Thursday!) 2. Comparison of SVs: SURIVOR 3. Annotation and interpretation of SVs: SURVIVOR_ANT 4. Population analysis: SURVIVOR 5. Diploid Genome: GADGET All methods are available: https://github.com/fritzsedlazeck https://fritzsedlazeck.github.io/ 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 020406080100 Random vs. informed choice of samples (CCDG) # of chosen Samples SVinpopulation(%) Informed Top100 Random Number of chosen samples SVinpopulation(%)
  28. 28. William Salerno Stephen Richards Richard Gibbs Michael Schatz Schatz lab Acknowledgments Daniel Jeffares Jürg Bähler Christophe Dessimoz Justin Zook GiaB consortium

Editor's Notes

  • Welcome everyone. My name is Fritz Sedlazeck and I am currently working at the Human Genome Sequencing Center @ Baylor in Houston.
    Today I am going to talk about challenges in SV calling and our pursue of the diploid genome that we are working on. Before I dive into that let me shortly introduce myself and my scientific interest.
  • I am a computational biologist mainly focusing on method developing for mapping and assembly of short and long reads.
    Detecting of genomic variations focusing on SVs.
    Benchmarking and detecting biases in methods and sequencing technologies
    And to apply all of these to obtain more insights into molecular biology.
    The focus of the talk today is around structural variations.

  • Only when we account for all variations we will be able to obtain deeper insights.
  • However there are certain challenges to get there.
  • Look at the Venn again. Probably each caller has some fraction of true and false positives. Reflecting the complexity of calling SVs
  • Structural Varitions are in generally loosely defined as 50bp+ …..

    Short read based callign often discussed to lack sensitivity and large FDR!
  • Avg len for Pacbio nowardays much higher!
    Many Deletions on Nanopore -> 11,394 (96.19%) were deletions, and the majority (89.72%) were within a homopolymer .
    Check indel for significanc


    @ONT INS: probably missing repetitive regions due to caller??

  • 3 times sequenced in different labs!
  • This highlights a huge bias in short reads and explains why illumina is not enough!
  • Look at the Venn again. Probably each caller has some fraction of true and false positives.
    So one possible solution could be to combine (make a consensus call).
  • So now that we can call SVs across different callers. How can we annotate and rank these calls?
  • AC131097.3: Long non coding
    SMYD3: Histone methyltransferase
  • Now we have the calls and annotation. Are these SVs common in the population or unique to my sample??
  • CCDG 2hours 30 min.

    Tomato: 6min.


    CCDG: NHGRI program; 30x WGS
  • #Singeltons/stats..
    Fraxction of rare vs. common.

    Patients -> Individuals.
  • Now that we have methods to identify SVs, reduce FDR, annotate and have mechanism to know if they are rare or common, we need to understand their context -> Diploid genome.
  • #SV singeltons, #SV two samples?

    Put greedy curv without crossing out.


    Interesting:
    Do the curve for 3 sd .
    More time!
  • Display slide during questions.

    Check with Will!!

×