Forsharing cshl2011 sequencing


Published on

Short overview talk on exome and genome sequencing and DNAse-seq.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Forsharing cshl2011 sequencing

  1. 1. High-­‐Resolu,on  Views  of   Cancer  Genomes  
  2. 2. The  Central  Dogma  
  3. 3. +  
  4. 4. Your  Nature  Paper  
  5. 5. Our  First  Experiment  
  6. 6. Overview  of  BAC  in  the  Genome  
  7. 7. Sequencing  a  BAC  
  8. 8. Sequence  Coverage  
  9. 9. Repeats  
  10. 10. Repeats  
  11. 11. Repeats  are  not  created  equal  
  12. 12. Genomic  Sequencing   TargeFng  the  Exome  
  13. 13.   Long  oligos  synthesized  on   arrays  (DNA)    RNA  baits  synthesized   from  DNA  oligo  template    RNA  baits  hybridized  to   DNA  sequencing  library    Targets  captured  using   beads  and  bioFn-­‐labeled   baits    RNA  bait  degraded,   leaving  sequencing  library   enriched  for  target  regions  
  14. 14. Data  Flow    FASTQ  files  generated  by  Illumina  pipeline    Aligned  to  reference  genome  (hg18,  excluding   _random,  unmapped,  and  hap)  using  Novoalign     SAM/BAM  used  extensively    Follow  Broad  InsFtute  GATK  pipeline  for  exome   capture    Use  picard  java  library  for  quality  assessment    Processed  BAM  files  available  via  local  hZp  for   browsing  
  15. 15. Data  Pipeline....    Samtools  import    Samtools  sort    Picard  MarkDuplicates    GATK  Indel  Realignment    GATK  Quality  RecalibraFon    Picard  QC  metrics  
  16. 16. Realignment  around  Indels    The  problem     Aligners  align  each  read  independently     PotenFally  leads  to  increased  error  rates  around   indels    A  potenFal  soluFon     Locally  realign  reads  in  regions  that  might   harbor  an  indel     Goal  is  to  align  reads  overlying  indels  more   accurately,  reducing  errors  in  each  read  and,  in   turn,  reducing  SNV  call  error  rates  
  17. 17. Quality Recalibration  Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important  Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases  Quality recalibration can include covariates to account for systematic biases   Cycle count, dinucleotide context, original quality, and sample/library variables
  18. 18. Variant  Calling  and  EvaluaFon   A  developing  art  
  19. 19. Sequencing  Tumor/Normal  Pairs  
  20. 20. Good  SNP  
  21. 21. Suspect  Variant  
  22. 22. SomaFc  (tumor  only)  Variant  
  23. 23. Likely  False  PosiFve  (normal  only)  
  24. 24. LOH  
  25. 25. NCI60  Exome  Sequencing   No  Normals  Available!  
  26. 26. Variants  by  Genomic  LocaFon  
  27. 27. All  Coding  Variants  
  28. 28. Type  1:  in  dbSNP,  Type  2:  not  in  dbSNP  
  29. 29. Coding,  novel  (no  dbSNP)  
  30. 30. Copy  Number  from  Exomes  
  31. 31. Complete  Genome  Sequencing   Complete  Genomics  Data  
  32. 32. Data    Delivery     Via  USB  results    Storage     Sizes  are  LARGE     400GB  per  sample  as  delivered  with  raw  reads  included     Should  use  2-­‐locaFon  backed-­‐up  storage     Not  trivial  to  find  such  storage,  so  might  resort  to  mulFple   USB  drives     Minimize:     Data  movement     Keeping  mulFple  copies  indefinitely  
  33. 33. Breakdown  of  Data  Sizes  
  34. 34. Data    Delivery    Storage    Processing     Data  are  typically  tab-­‐delimited  text  files,  so  Excel   can  be  useful  for  examining  individual  small  files     Generally,  command-­‐line  tools  needed     MacOS  and  linux  only  supported  operaFng   systems,  but  Windows  might  work....     Some  analyses  (snpdiff)  require  large  memory  
  35. 35. Directory  Structure  
  36. 36. Workflows    Tumor/Normal     Copy  Number     Structural  Varia,on     Annotated  SomaFc  Variants    Germline     List  of  annotated  genotypes  per  individual,   summarized  into  a  single  file  that  can  be  used  for   filtering  
  37. 37. Germline  Workflow  
  38. 38. Germline  Workflow    Output    Future  direcFons     Be  “smarter”  about  inheritance  framework     Further  refinements  of  comparison  to  other  data   types  (exomes,  snp  arrays,  RNA-­‐seq)  
  39. 39. Tumor/Normal  Workflow  
  40. 40. Medvedev  et  al.,  Nature  2009  
  41. 41. Frequent  geneFc  alteraFons  in  three  criFcal  signalling  pathways.   The  Cancer  Genome  Atlas  Research  Network  Nature  000,  1-­‐8  (2008)  doi:10.1038/nature07385  
  42. 42. ChromaFn    ChromaFn  is  the  complex  of  protein  and  DNA  that  make  up   the  chromosomes.    It  is  not  a  staFc  structure.  
  43. 43.   DNAse  is  an  enzyme   that  cuts  DNA  at   locaFons  where  DNA  is   accessible    These  “accessible”   regions  have  been   associated  with  open   chromaFn    Regions  of  open   chromaFn  are   necessary  for   transcripFonal  and   regulatory  machinery  to   have  access  to  gene   neighborhoods  and   facilitate  transcripFon  
  44. 44. DNAse  HypersensiFvity     Method  for  finding  regions  of  “open”   chromaFn     In  data  published  with  the  ENCODE   consorFum,  DNAse  hypersensiFve  (HS)   were  shown  to  be  correlated  with:     Histone  modificaFon     TranscripFon  start  sites     Early  replicaFng  regions     TranscripFon  factor  binding  sites   (experimentally  determined  by  ChIP/chip,   etc.)  IdenFficaFon  and  analysis  of  funcFonal  elements  in  1%  of  the  human  genome  by  the  ENCODE  pilot  project.    The  ENCODE  ConsorFum.    Nature,  2007.  
  45. 45. DNAse-­‐chip  Method  Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  
  46. 46. DNAse-­‐Seq  Method  Crawford,  G.E.,  Davis,  S.,  Scacheri,  P.C.,  Renaud,  G.,  Halawi,  M.J.,  Erdos,  M.R.,  Green,  R.,  Meltzer,  P.S.,  Wolfsberg,  T.G.,  and  Collins,  F.S.  Nat  Methods,  2006  
  47. 47. DNAse  Sites  RelaFve  to  Genes  
  48. 48. DNAse  HS  Sites  and  Gene  Expression    DNAse  HS  sites  near   transcripFon  start  sites   are  associated  with   acFvely  transcribed   genes.  
  49. 49. Nucleosome  PosiFoning    Distances  between  sequences   in  non-­‐DNAse  HS  regions  have   an  oscillaFng  paZern  with   frequency  that  corresponds  to   a  single  turn  of  the  double-­‐ helix    DNAse  is  known  to  cut   preferenFally  in  the  minor   groove,  which  is  exposed  every   10.4  bases  when  wrapped   around  a  nucleosome    A  nucleosome  is  wrapped  by   147  base  pairs  when   complexed  with  DNA    ImplicaFon:  Nucleosomes  are   posiFoned  in  a  highly   organized,  precise  manner  
  50. 50. The  Last  Mile  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.