0
The	
  Future	
  of	
  DNA	
  Sequencing	
  
Technology	
  
Graham	
  Taylor	
  
Melbourne	
  University,	
  Human	
  Vari...
Context	
  and	
  Topics	
  
1.  Technology	
  review	
  &	
  IdiosyncraFc	
  selecFon	
  
of	
  noteworthy	
  development...
An	
  UlFmate	
  Goal	
  for	
  Sequence	
  analysis?	
  
For	
  sequencing	
  
–  Chromosome-­‐length	
  reads	
  
–  Per...
Research,	
  translaFon	
  and	
  service	
  
•  Original	
  
•  Surprising	
  
•  >80%	
  accurate	
  
•  Numerator-­‐dri...
Cost	
  and	
  performance	
  
cost	
  per	
  base	
   Illumina	
  share	
  price	
  
Now	
  is	
  the	
  winter	
  of	
  ...
The	
  case	
  for	
  disease-­‐centric	
  analysis	
  
•  $1,000	
  dollar	
  genomes	
  or	
  1,000	
  x	
  $1	
  intere...
Sequence	
  performance	
  and	
  clinical	
  needs	
  
number'of'reads
length'of'reads
Genetics Tumor-Analysis Microbiolo...
How	
  many	
  variants	
  per	
  exome?	
  
SNP	
  count	
   Study	
  
20,000	
   Choi	
  et	
  al.	
  PNAS	
  2009	
  
1...
Low	
  concordance	
  of	
  mulFple	
  variant-­‐calling	
  pipelines	
  	
  
O’Rawe	
  et	
  al.	
  Genome	
  Medicine	
 ...
Venn	
  diagrams	
  of	
  selected	
  CNV	
  detecFon	
  
methods	
  in	
  real	
  data	
  processing	
  
Duan	
  J,	
  Zh...
De	
  novo	
  Assembly	
  
(the	
  unfinished	
  genome)	
  
•  Genome	
  Res.	
  2014.	
  24:	
  688-­‐696	
  2014	
  Hudd...
Recent	
  past	
  and	
  future	
  
RIP	
   Coming	
  soon?	
  
SBS	
  
•  GnuBIO/BioRad	
  :emulsion	
  microfluidics	
  for	
  targeted	
  
sequencing	
  and	
  hotspot	
  analysis	
  o...
Currently	
  SBS	
  are	
  Market	
  Leaders	
  
•  Illumina	
  
•  Proton	
  Torrent	
  
•  PacBio	
  
PacBio	
  
•  English	
  et	
  al.	
  (2012)	
  Mind	
  the	
  Gap:	
  Upgrading	
  
Genomes	
  with	
  Pacific	
  Bioscien...
Nanopores	
  
•  Electronic	
  BioSciences:	
  developing	
  a	
  system	
  with	
  a	
  single/few	
  pores	
  
with	
  a...
Real	
  long	
  reads	
  Nanopore	
  sequencing	
  
8,476	
  base	
  single	
  read	
  
Not	
  producFon	
  ready	
  
30
40
50
60
70
30
40
50
60
70
30
40
50
60
70
30
40
50
60
70
30
40
50
60
70
30
40
50
60
70
30...
Electron	
  Microscopy	
  
•  ZS	
  GeneFcs:	
  directly	
  visualizes	
  the	
  sequence	
  of	
  DNA	
  molecules	
  
us...
Electron	
  Microscopy	
  
Progress	
  toward	
  an	
  aberraFon-­‐corrected	
  low	
  energy	
  
electron	
  microscope	
...
Hardware	
  Trends	
  
•  Clonal	
  sequencing	
  
– Increasing	
  accuracy	
  
– Increasing	
  read	
  lengths	
  
– Incr...
Increasing	
  read	
  counts	
  via	
  palerned	
  flow	
  cells	
  
•  Palerned	
  flow-­‐cells	
  useful	
  for	
  nucleic...
Palerned	
  flow	
  cells,	
  super	
  Poisson	
  kineFcs	
  
Pseudo-­‐long	
  reads	
  via	
  “molceculo”	
  
Genome	
  informaFcs	
  example..	
  
•  Does	
  Moleculo’s	
  technology	
  have	
  both	
  a	
  wet	
  lab	
  and	
  a	
...
Reducing	
  assembly	
  complexity	
  of	
  microbial	
  
genomes	
  with	
  single-­‐molecule	
  sequencing	
  identifyin...
Clinical	
  Drivers	
  
•  Manageable	
  workflow	
  
•  Cost	
  efficiency	
  
•  SensiFvity	
  and	
  specificity	
  
•  Ref...
AdapFng	
  NGS	
  to	
  purpose	
  
control'
expansion'
0'
200'
400'
600'
800'
1000'
1200'
(GGCCCC)4'
(GGCCCC)5'
(GGCCCC)6...
Library	
  ConstrucFon	
  
Covaris	
  optimising
lane Sample
M E-­‐gel	
  Quant	
  ladder
1 gDNA
2 A
3 B
4 C
5 50bp	
  lad...
NextEra	
  method	
  of	
  Library	
  ConstrucFon	
  
Higher	
  coverage	
  greater	
  reproducibility	
  
Coverage	
   Coefficient	
  of	
  variaFon	
  
Can	
  	
  we	
  capture	
  coverage	
  report	
  dosage	
  to	
  
diagnosFc	
  standards?	
  samples	
  
targets	
  
samp...
XX	
  vs.	
  XY	
  
8	
  Female	
  cases	
  and	
  16	
  Male	
  cases	
  showing	
  reproducibility	
  of	
  coverage	
  ...
EPCAM	
  exon	
  9	
  to	
  MSH2	
  exon	
  8	
  and	
  EPCAM	
  
exon	
  9	
  to	
  MSH2	
  exon	
  1	
  deleFons	
  
0"
...
IntegraFng	
  data	
  handling	
  with	
  sequencing	
  operaFon	
  
Partly	
  auto-­‐fills	
  metadata	
  file	
  
(based	
...
AutomaFc	
  data	
  populaFon	
  for	
  sample	
  sheets	
  
HiSeq	
   MiSeq	
  
ProducFon:	
  
Taffy	
  
Centos	
  	
  
11TB	
  
Development:	
  
S’Box	
  
Centos	
  
1	
  TB	
  Syste...
NGS-­‐DB	
  
Projects	
  
Project_ID	
  
Project_DescripFon	
  
User	
  
Experiments	
  
Exp_ID	
  
Project_ID	
  
Exp_Des...
AlternaFves	
  to	
  read	
  mapping	
  and	
  
alignment	
  
•  Grouped	
  read	
  tesFng	
  
– Amplivar,	
  AmbiVert	
  ...
coverage
statistics
each amplicon read
sorted by primers
grouped amplicon
variants
grab amplicons
sort by locus
group ampl...
Amplivar	
  vs.	
  Alignment	
  
Amplivar	
   Alignment	
  (e.g.	
  BWA)	
  
Groups	
  reads	
   Uses	
  individual	
  rea...
quality_filter	
  
Hard	
  coded	
  quality	
  filters	
   Output	
  FASTA	
  and	
  qcore	
  files	
  
FASTA	
  files	
  (.fn...
fastq	
  and	
  fasta	
  
>MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name!
TGCGTCATCATCTTTGTCATCGTGTACTACGCC...
group_fasta_reads	
  
•  @file_list	
  =	
  glob	
  "$merged_dir/*fna"	
  ;	
  
Grouped	
  reads	
  
3 !AGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATAG!
!
1 !ACCACTTTTGGAGGGAGATTTCGCTCC...
Usual	
  suspects	
  file	
  
4	
  column	
  tab	
  separated	
  text	
  file	
  with	
  Unix	
  line	
  endings	
  
•  Colu...
Genotype	
  table	
  
library
BRAF+600+
wt+V600
BRAF+600+
c.1799T>
A+V600E
KRAS+12+
&+13+wt+
G12/G13
KRAS+12+
c.34G>A+
G12...
Amplicon	
  flank	
  file	
  
4	
  column	
  tab	
  separated	
  text	
  file	
  with	
  Unix	
  line	
  endings	
  
•  Colum...
Grouping	
  Reads	
  
amplivar –i * –j * –k *!
!
-i /storage/local/sandbox/working_directory!
-j usual_suspects.txt !
-k f...
AMPLIVAR	
  
SeqPrep:	
  	
  
Remove	
  adapters	
  &	
  
Merge	
  reads	
  
Filter	
  reads	
  by	
  quality	
  
Convert	...
AMPLIVAR	
  WRAPPER	
  
AMPLIVAR	
  
SeqPrep:	
  	
  
Remove	
  adapters	
  &	
  
Merge	
  reads	
  
Filter	
  reads	
  by...
AmpliVar	
  Required	
  tools	
  
•  SeqPrep	
  (C)	
  
•  Blat	
  (C)	
  
•  Samtools	
  (C)	
  
•  BamlePalign	
  (C++)	...
>1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC734
>1_MPL1...
Sample Amplicon Read %/of/major/allele
MS0318_1164/Blood 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGT...
Clustal	
  alignment	
  &	
  
phylogeny	
  of	
  errors	
  
Merged	
  forward	
  and	
  reverse	
  reads	
  
PandaSeq	
  and	
  SeqPrep	
  can	
  merge	
  overlapping	
  read	
  pair...
Orthogonal	
  validaFon	
  without	
  Sanger	
  Scale
:chr17
--->
RefSeq Genes
RepeatMasker
10 baseshg19
41,223,07041,223,...
Downstream	
  processing	
  of	
  FASTA	
  files	
  
BWA,	
  BLAT,	
  annotaFon	
  
VEP	
  
MIST	
  
A schematic of the workflow used by MiST.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Identification of potentially paralogous read pairs.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Motivation for the use of Geoseq in variant calling.
Subramanian S et al. Nucl. Acids Res. 2013;41:e154
Comparison of MiST and GATK. Each box has three sets of numbers, from left to right they
are variant calls, (i) unique to ...
DIAMUND:	
  Direct	
  Comparison	
  of	
  
Genomes	
  to	
  Detect	
  Muta2ons	
  	
  
Figure 1. Outline of initial steps ...
Filtering	
  staFsFcs	
  
Table 1. Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutate...
Panagopoulos	
  et	
  al.	
  
Plosone	
  2014	
  	
  Volume	
  9	
  (6)	
  e99439	
  
The	
  ‘‘Grep’’	
  Command	
  But	
 ...
Fig. 1. Sensing and locating short tandem repeats (STRs) in short reads. (A) An original short read. (B) An approximate ST...
Standards	
  
Human	
  Variome	
  Project	
  
Prototype	
  NGS	
  database	
  
Report	
  
Sharing	
  Experience	
  with	
  TruSight	
  One	
  
•  In	
  partnership	
  with	
  Illumina,	
  RCPA	
  and	
  the	
  HG...
Next	
  Steps	
  
•  Robust	
  standards	
  for	
  genomic	
  medicine	
  
•  Databases	
  and	
  data	
  content	
  
–  A...
Data	
  quality	
  classes	
  
DifferenFate	
  between	
  three	
  classes	
  of	
  data:	
  
	
  
The	
  Clinically	
  Rep...
Standards	
  for	
  AccreditaFon	
  of	
  DNA	
  
Sequence	
  VariaFon	
  Databases	
  
Quality	
  Use	
  of	
  Pathology	...
Pathogenicity	
  
1.  “Deleterious-­‐	
  and	
  Disease-­‐Allele	
  Prevalence	
  in	
  Healthy	
  Individuals:	
  
Insigh...
Table	
  1.	
  List	
  of	
  selected	
  CNV	
  detec2on	
  methods.	
  
Duan	
  J,	
  Zhang	
  J-­‐G,	
  Deng	
  H-­‐W,	
...
Summary	
  
•  Current	
  sequencing	
  technology	
  has	
  plenty	
  of	
  
room	
  for	
  improvement	
  w.r.t.	
  read...
Acknowledgments	
  
•  Genomic	
  Medicine	
  &	
  Centre	
  for	
  TranslaFonal	
  
Pathology,	
  University	
  of	
  Mel...
Targeted	
  Tumour	
  Sequencing:	
  	
  
© 2014 Illumina, Inc. All rights reserved.
BWA Enrichment, Version 1.0.0.1
Page ...
Called	
  Variants	
  
FFPE	
  Frozen	
  
Blood	
  
16,711	
  
2,095	
   154	
  
979	
  
3,641	
  
9,368	
   2,455	
  
Gene	
  List	
  
GFR
city.
rect
t of
es a
ittle
has
atly
is.78
is
iza-
pre-
fied
ain- Figure 3 Frequency of major driver m...
Filtering	
  Variants	
  
All	
  variants	
   None	
   Qual	
   Not	
  in	
  Blood	
  
Blood	
   9828	
   8551	
   NA	
  
...
EGFR	
  p.L858R	
  
EGFR	
  p.T790M	
  
ConfirmaFon	
  by	
  PCR	
  
0.0#
50.0#
100.0#
150.0#
200.0#
250.0#
EGFR_NM
_005228.3#T790#T790#W
T#
EGFR_NM
_005228.3#784#...
Gene	
  list	
  summary	
  
gene locus
covered,by,
capture,
panel? Observations Notes
AKT1 chr14:105,235,7611105,262,116 y...
Graham Taylor - The future of DNA sequencing technology
Graham Taylor - The future of DNA sequencing technology
Upcoming SlideShare
Loading in...5
×

Graham Taylor - The future of DNA sequencing technology

1,272

Published on

I will review the recent history of “Post‐Sanger” sequencing technology, and then make some wild and unjustified extrapolations into the future based on too few data points.
I will review some of the technologies on the horizon and ask how we can appraise them.
For example, if we can define sequence read quality as a composite of read length and base‐calling accuracy, recent trends have overwhelmingly been in the direction of quantity at the expense of quality. As a consequence a great deal of informatics effort has been expended in managing rather poor quality data. Of course the human
genome, along with many other genomes, is not particularly amenable to analysis, contain entities such as pseudogenes, non‐coding regions (sometimes referred to as “junk”, sometimes claimed to be functionally important) and short repeats. So how does the collision of a relatively refractory analyte like the human genome and an imperfect sequencing method result in a “genomics revolution”? What have we gained and what are the current limitations that need to be addressed in future technologies?
I will look at two examples of the impact of current and pending sequencing technology: tumour analysis in fixed and fresh tissue and the identification of allele expansions.

First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Published in: Science
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,272
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
205
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Graham Taylor - The future of DNA sequencing technology"

  1. 1. The  Future  of  DNA  Sequencing   Technology   Graham  Taylor   Melbourne  University,  Human  Variome   Project  (Australia),  Victorian  Clinical   GeneFcs  Laboratories  
  2. 2. Context  and  Topics   1.  Technology  review  &  IdiosyncraFc  selecFon   of  noteworthy  developments  and  trends  in   NGS  hardware  and  soPware  from  the   perspecFve  of  Genomic  Medicine  (so  mostly   human  genome)  in  the  context  of  meeFng   clinical  needs   Sadly,  not  covering  Transcriptomics,  ChIP-­‐seq,  non-­‐human   genomes   2.  ApplicaFons  and  implicaFons  for  diagnosFcs  
  3. 3. An  UlFmate  Goal  for  Sequence  analysis?   For  sequencing   –  Chromosome-­‐length  reads   –  Perfect  base  calling  accuracy   –  Each  molecule  is  read   –  Highly  parallel   For  analysis     –  De  novo  assembly   –  Well  curated  reference  resources   –  Data  integrated  with  other  biological  and  medical   resources  
  4. 4. Research,  translaFon  and  service   •  Original   •  Surprising   •  >80%  accurate   •  Numerator-­‐driven:  get   publicaFons   •  Bespoke   •  Proven   •  Predictable   •  >99.99%  accurate   •  Denominator-­‐driven  (cost   sensiFve)   •   Standardised  
  5. 5. Cost  and  performance   cost  per  base   Illumina  share  price   Now  is  the  winter  of  our  discount  tests  (unless  you  are  Illumina)  
  6. 6. The  case  for  disease-­‐centric  analysis   •  $1,000  dollar  genomes  or  1,000  x  $1  interesFng  regions?   •  How  to  validate  3.5x  109  tests   •  Sequencing  costs  are  not  limiFng   •  Quality  and  accuracy  are  incomplete   •  Perform  tests  for  a  (clinical)  reason    
  7. 7. Sequence  performance  and  clinical  needs   number'of'reads length'of'reads Genetics Tumor-Analysis Microbiology Sample/library,preparation 3 4 4 Base,calling,accuracy 5 5 3 De,novo,assembly 3 5 4 Detect,Rare,Events 3 5 5 Portability 2 3 4
  8. 8. How  many  variants  per  exome?   SNP  count   Study   20,000   Choi  et  al.  PNAS  2009   142,000   Mullikin  NIH,  unpublished  2010   50,000   Clark  et  al.  Nature  biotechnology  2011   125,000   Smith  et  al.  Genome  Biology  2011   100,000     Johnston  &  Biesecker  Human  Molecular  GeneFcs  2013   200,000  to  400,000   Yang  et  al.N  Engl  J  Med  2013   •  20-­‐fold  range   •  Exome  designs  vary   •  Likely  to  be  higher  variant  count  in  African  populaFons  as  the   reference  sequence  is  non-­‐African  
  9. 9. Low  concordance  of  mulFple  variant-­‐calling  pipelines     O’Rawe  et  al.  Genome  Medicine  2013,  5:28     SNV  concordance:  57.4%   Indel  concordance  26.8%  
  10. 10. Venn  diagrams  of  selected  CNV  detecFon   methods  in  real  data  processing   Duan  J,  Zhang  J-­‐G,  Deng  H-­‐W,  Wang  Y-­‐P  (2013)  ComparaFve  Studies  of  Copy  Number  VariaFon  DetecFon  Methods  for  Next-­‐GeneraFon  Sequencing   Technologies.  PLoS  ONE  8(3):  e59128.  doi:10.1371/journal.pone.0059128   hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128  
  11. 11. De  novo  Assembly   (the  unfinished  genome)   •  Genome  Res.  2014.  24:  688-­‐696  2014  Huddleston  et  al.     –  Within  the  human  genome,  there  are  >900  annotated  genes   mapping  to  large  segmental  duplicaFons.  Such  genes  are   typically  missing  or  misassembled  in  working  draP  assemblies  of   genomes   –  The  widespread  adopFon  of  next-­‐generaFon  sequencing   methods  for  de  novo  genome  assemblies  has  complicated  the   assembly  of  repeFFve  sequences  and  their  organizaFon   –  resolved  regions  that  are  complex  in  a  genome-­‐wide  context   but  simple  in  isolaFon  for  a  fracFon  of  the  Fme  and  cost  of   tradiFonal  methods  using  long-­‐read  single  molecule,  real-­‐Fme   (SMRT)  sequencing  and  assembly  technology   –  SMRT  sequencing  of  large-­‐insert  clones  can  significantly   improve  sequence  assembly  within  complex  repeFFve  regions   of  genomes  
  12. 12. Recent  past  and  future   RIP   Coming  soon?  
  13. 13. SBS   •  GnuBIO/BioRad  :emulsion  microfluidics  for  targeted   sequencing  and  hotspot  analysis  of  rare  variants   •  LaserGen:  Lightning  Terminators™;  increased  accuracy,   longer  reads  and  faster  cycle-­‐Fmes  Nucleic  Acids  Res.  Oct   2007;  35(19):  6339–6349.TerminaEon  of  DNA  synthesis  by   N6-­‐alkylated,  not  3ʹ′-­‐O-­‐alkylated,  photocleavable  2ʹ′-­‐ deoxyadenosine  triphosphate  Weidong  Wu  et  al.   •  Qiagen/Intelligent  Biosystems   •  QuantuMDx:  Nat  Biotechnol.  2005  Oct;23(10):1294-­‐301.   MulFplexed  electrical  detecFon  of  cancer  markers  with   nanowire  sensor  arrays  Zheng  G,  Patolsky  F,  Cui  Y,  Wang   WU,  Lieber  CM  
  14. 14. Currently  SBS  are  Market  Leaders   •  Illumina   •  Proton  Torrent   •  PacBio  
  15. 15. PacBio   •  English  et  al.  (2012)  Mind  the  Gap:  Upgrading   Genomes  with  Pacific  Biosciences  RS  Long-­‐ Read  Sequencing  Technology.  PLoS  ONE  7(11):   e47768   •  Loomis  et  al  (Sequencing  the  unsequenceable:   Expanded  CGG-­‐repeat  alleles  of  the  fragile  X   gene  Genome  Research  (2012)  
  16. 16. Nanopores   •  Electronic  BioSciences:  developing  a  system  with  a  single/few  pores   with  a  very  fast  rate  of  sequencing  of  ~50kb/second   •  Genia:  DNA  polymerase  to  incorporate  nucleoFdes  with  PEG-­‐based   NanoTags.  As  the  bases  are  incorporated  the  NanoTags  are  cleaved,   allowing  them  to  travel  through  the  pore  where  they  can  be   measured,  generaFng  sequence-­‐specific  informaFon   •  IBM:  a  solid  state  nanopore  using  alternaFng  layers  of  metal  and   dielectric  material  to  control  the  rate  of  passage  through  the   nanopore   •   NABsys:  Modified  DNA  (e.g.  by  SBH)  read  via  Nanopore.    Not  yet   sequencing,  but  very  long  reads   •  NobleGen:  combinaFon  of  opFcal  detecFon  on  nanopores   •  Oxford  Nanopore:  exonuclease  and  strand-­‐based  nanopore   methods  
  17. 17. Real  long  reads  Nanopore  sequencing   8,476  base  single  read  
  18. 18. Not  producFon  ready   30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 30 40 50 60 70 total time 273 seconds meansignal(picoamps) Wiggle  plot   Viterbi  algorithm  for  all  trinucloFdes    
  19. 19. Electron  Microscopy   •  ZS  GeneFcs:  directly  visualizes  the  sequence  of  DNA  molecules   using  electron  microscopy.  Proof  of  principle  by  the  use  of  a  dUTP   nucleoFdewith  a  single  mercury  atom  alached  to  the  nitrogenous   base.  This  modificaFon  is  small  enough  to  allow  very  long   molecules  with  labels  at  each  A-­‐U  to  be  seen  using  annular  dark-­‐ field  scanning  transmission  electron  microscopy  (ADF-­‐STEM)   Microsc  Microanal.  2012  Oct;18(5):1049-­‐53  DNA  base  idenFficaFon   by  electron  microscopy  Bell  DC,  Thomas  WK,  Murtagh  KM,  Dionne   CA,  Graham  AC,  Anderson  JE,  Glover  WR.   •  Reveo:  atomic  force  microscopy  called  the  Omni  Molecular   Recognizer  ApplicaFon  (OmniMoRA),  will  use  arrays  of  nano-­‐knife   edge  probes  to  measure  the  vibraFonal  characterisFcs  of  individual   bases  on  DNA  molecules  that  have  been  stretched  and  immobilized   on  a  surface  
  20. 20. Electron  Microscopy   Progress  toward  an  aberraFon-­‐corrected  low  energy   electron  microscope  for  DNA  sequencing  and  surface   analysis.  Mankos  M,  Shadman  K,  N'diaye  AT,Schmid  AK,   Persson  HH,  Davis  RW.  Vac  Sci  Technol  B  Nanotechnol   Microelectron.  2012  Nov;30(6):6F402   Imaging  of  reduced  5ʹ′-­‐/5DTPA/C-­‐20mer  on  Au  substrate:  (a)  (b)  AFM  images  at  two   magnificaFons,  (c)  height  profile  along  line  shown  in  (a),  (d)  height  profile  along  line   shown  in  (b),  and  (e)  LEEM  images  at  three  different  landing  energies.   Aiming  for  50  megabase  reads  with  phred  60  
  21. 21. Hardware  Trends   •  Clonal  sequencing   – Increasing  accuracy   – Increasing  read  lengths   – Increasing  read  counts   •  Single  molecule  sequencing   – PacBio   – Oxford  Nanopore  
  22. 22. Increasing  read  counts  via  palerned  flow  cells   •  Palerned  flow-­‐cells  useful  for  nucleic  acid   analysis  US  20120316086  A1   •  KineFc  exclusion  amplificaFon  of  nucleic  acid   libraries  WO  2013188582  A1   –  (i)  capturing  the  different  target  nucleic  acids  at  the   amplificaFon  sites  at  an  average  capture  rate,  and     –  (ii)  amplifying  the  target  nucleic  acids  captured  at  the   amplificaFon  sites  at  an  average  amplificaFon  rate,   wherein  the  average  amplificaFon  rate  exceeds  the   average  capture  rate.  
  23. 23. Palerned  flow  cells,  super  Poisson  kineFcs  
  24. 24. Pseudo-­‐long  reads  via  “molceculo”  
  25. 25. Genome  informaFcs  example..   •  Does  Moleculo’s  technology  have  both  a  wet  lab  and  a   bioinformaFcs  aspect?     •  Yes,  it’s  about  50:50.  One  doesn’t  make  sense  without  the   other.  There  are  two  components:  first,  there  is  a   molecular  biology  kit  and  protocol  that  takes  in  genomic   DNA  and  turns  it  into  a  sequencer-­‐compaFble  library.  APer   modifying  and  tagging  the  DNA,  this  allows  the  second   component,  the  algorithmic  part,  to  take  the  short  reads   and  reconstructs  long  reads  using  those  tags.  Those  are   two  separate  parts.  We  developed  both  on  campus,  and   improved  upon  them  aPer  we  started  the  company  last   year.    
  26. 26. Reducing  assembly  complexity  of  microbial   genomes  with  single-­‐molecule  sequencing  identifying DNA modification, such as methylation pat- terns, directly from the single-molecule sequencing data [15]. While adoption of this technology was initially slowed by the low accuracy of the single-pass sequences, recent advancements have demonstrated that this drawback can be algorithmically managed to produce assemblies of un- matched continuity [7,8,16]. Steady improvements to the PacBio technology continue to increase read lengths and yield [17], while future technologies promise to combine accuracy with length using either nanopores [11] or ad- vanced sample preparation [18]. Improved microbial gen- ome assembly is an obvious application of these recent developments in long-read sequencing. Genome assembly is the process of reconstructing a genome from many shorter sequencing reads [19-21]. It is typically formulated as finding a traversal of a properly defined graph of reads, with the ultimate goal of reconstructing the original genome as faithfully as pos- sible. Repeated sequence in the genome induces com- plexity in the graph and poses the greatest challenge to all assembly algorithms [22]. In addition, repeats are often the focus of analysis [23-25], making their correct assembly critical for subsequent studies. However, re- peats can only be resolved by a spanning read or read pair that is uniquely anchored on both sides. Read pairs are typically used due to their length potential (tens of kilobase pairs), but introduce additional complexity because they cannot be precisely sized. Alternatively, long-read sequencing promises to more accurately re- solve repeats and directly assemble genomes into their constituent replicons. Figure 1 shows the benefit of in- creasing read length when assembling Escherichia coli K12 MG1655. This genome can only be assembled into a single contig when the read length exceeds the size of the longest repeat in the genome, a multi-copy rDNA operon. The rDNA operon, sized around 5 to 7 kbp, is A B C Figure 1 Genome assembly graph complexity is reduced as sequence length increases. Three de Bruijn graphs for E. coli K12 are shown for k of 50, 1,000, and 5,000. The graphs are constructed from the reference and are error-free following the methodology of Kingsford et al. [27]. Non-branching paths have been collapsed, so each node can be thought of as a contig with edges indicating adjacency relationships that cannot be resolved, leaving a repeat- induced gap in the assembly. (A) At k = 50, the graph is tangled with hundreds of contigs. (B) Increasing the k-mer size to k = 1,000 significantly simplifies the graph, but unresolved repeats remain. (C) At k = 5,000, the graph is fully resolved into a single contig. The single contig is self-adjacent, reflecting the circular chromosome of the bacterium. Koren et al. Genome Biology 2013, 14:R101 Page 2 of 16 http://genomebiology.com/2013/14/9/R101 Long,  single-­‐molecule  reads  are   sufficient  for  the  complete  assembly   of  most  known  microbial  genomes.   The  assemblies  presented  here  have   good  likelihood  and  finished-­‐grade   consensus  accuracy  exceeding   99.9999%.   Koren  et  al.  Genome  Biology  2013,  14:R101  
  27. 27. Clinical  Drivers   •  Manageable  workflow   •  Cost  efficiency   •  SensiFvity  and  specificity   •  Referring  to  the  clinical  quesFon   •  Depth  vs.  breadth  of  coverage  
  28. 28. AdapFng  NGS  to  purpose   control' expansion' 0' 200' 400' 600' 800' 1000' 1200' (GGCCCC)4' (GGCCCC)5' (GGCCCC)6' Merging'forward'and'reverse'reads' Use  a  HiFi  polymerase   0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTTTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGTATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGTA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATTAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA TAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATCTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGAAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAAAGTAGGAAATGGAAGTCTATGTGATCAAGAAATTGATAGCATTTGCA CAGAAAGAGTAGAAAATGGAAGTCTATGTGATCAAGAAATCGATAGCATTTGCA Discarding  rare  (wrong)  reads   DetecFng  allele  expansion   0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 110"111"112"113"114"115"116"117"118"119"120"121"122"123"124"125"126"127"128" Count& Nucleo+de&base&pair&length& "Control"" CG11"(MI095)" Sample CASE Gene Genomic/Coordinates RefSeq VARIANT/c Variant/p Allele/%/ (Amplivar) Allele/%/ (MiSeq/ Reporter) 1 22029 BRCA2 chr13_329116280G>T NM_000059.3 c.3136G>T p.Glu1046Ter 26.03% 22.37% 2 22814 BRCA1 chr17_412464430insGA NM_007300.3 c.1105_1106insTC p.Asp369ValfsTer6 76.27% 75.65% 3 23074 BRCA2 chr13_329144380delT NM_000059.3 c.5946delT p.Ser1982ArgfsTer22 74.36% 74.27% 4 23162 BRCA1 chr17_412760430delCT NM_007300.3 c.68_69delAG p.Glu23ValfsTer17 76.04% 78.69% 5 23165 BRCA2 chr13_329140660delAATT NM_000059.3 c.5574_5577delAATT p.Ile1859LysfsTer3 100.00% 95.53% 5 23165 BRCA1 chr17_412444380del75 NM_007300.3 c.3005_3079del75 p.Asn1002_Ile1027del 48.71% Not0found 6 23179 BRCA1 chr17_41215948_G>A NM_007300.3 c.5095C>T p.Arg1699Trp 85.24% 87.14% 7 23210 BRCA2 chr13_329688360insT NM_000059.3 c.9266_9267insT p.Val3091ArgfsTer20 60.06% 59.64% 8 23815 BRCA1 chr17_412562060insA NM_007300.3 c.374dupT p.Gln126ProfsTer16 100.00% 94.23% 9 23824 BRCA1 chr17_4125691550delA NM_007300.3 c.271delT p.Cys91ValfsTer28 81.62% 83.32% 10 23828 BRCA2 chr13_329128870delATTAC NM_000059.3 c.4395_4399delATTAC p.Leu1466PhefsTer2 93.20% 91.76% MSI  by  NGS   Genotyping  
  29. 29. Library  ConstrucFon   Covaris  optimising lane Sample M E-­‐gel  Quant  ladder 1 gDNA 2 A 3 B 4 C 5 50bp  ladder Peak  at  200bp 50  bp ladder,  350bp  bright   band B:  213.4  bp B:  208.8  bp C  :  201.4  bp
  30. 30. NextEra  method  of  Library  ConstrucFon  
  31. 31. Higher  coverage  greater  reproducibility   Coverage   Coefficient  of  variaFon  
  32. 32. Can    we  capture  coverage  report  dosage  to   diagnosFc  standards?  samples   targets   samples   autosomal  targets  chrX  targets   Inter-­‐sample   variaFon  is  low,   But  low  coverage   prevents  dosage   esFmaFon   Chr  X  is  a  good  first  pass  test  for  dosage  
  33. 33. XX  vs.  XY   8  Female  cases  and  16  Male  cases  showing  reproducibility  of  coverage  of  X  loci   within  each  group.    Loci  with  higher  SDs  were  associated  with  reduced  coverage.   0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" 2" 0" 10" 20" 30" 40" 50" 60" 70" 80" Average"XX" Average"XY" !0.5% 0% 0.5% 1% 1.5% 2% 2.5% 3% 0% 10% 20% 30% 40% 50% 60% 70% 80% AVGE%XX% AVGE%XY% 870   160  
  34. 34. EPCAM  exon  9  to  MSH2  exon  8  and  EPCAM   exon  9  to  MSH2  exon  1  deleFons   0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" chr2:"47596595047596770"(EPCAM )" chr2:"47600552047600759"(EPCAM )" chr2:"47600897047601237"(EPCAM )" chr2:"47602323047602488"(EPCAM )" chr2:"47604103047604266"(EPCAM )" chr2:"47606042047606243"(EPCAM )" chr2:"47606858047607158"(EPCAM )" chr2:"47612255047612399"(EPCAM )" chr2:"47613661047613802"(EPCAM )" chr2:"47630281047630591"(M SH2)" chr2:"47635490047635744"(M SH2)" chr2:"47637183047637561"(M SH2)" chr2:"47639503047639749"(M SH2)" chr2:"47641358047641607"(M SH2)" chr2:"47643385047643618"(M SH2)" chr2:"47656831047657130"(M SH2)" chr2:"47672637047672846"(M SH2)" chr2:"47690120047690343"(M SH2)" chr2:"47693747047693997"(M SH2)" chr2:"47698054047698251"(M SH2)" chr2:"47702114047702459"(M SH2)" chr2:"47703456047703760"(M SH2)" chr2:"47705361047705708"(M SH2)" chr2:"47707785047708060"(M SH2)" chr2:"47709868047710138"(M SH2)" chr2:"48010323048010682"(M SH6)" chr2:"48018016048018312"(M SH6)" chr2:"48022983048023252"(M SH6)" chr2:"48025700048028344"(M SH6)" chr2:"48030509048030874"(M SH6)" chr2:"48031999048032216"(M SH6)" chr2:"48032707048032896"(M SH6)" chr2:"48033293048033547"(M SH6)" chr2:"48033541048033840"(M SH6)" chr2:"48033868048034049"(M SH6)"
  35. 35. IntegraFng  data  handling  with  sequencing  operaFon   Partly  auto-­‐fills  metadata  file   (based  on  samplesheet   metadata  file  name)   Checks  if  the  run  finishes   with  FASTQs  generated  on   board   Checks  if  the  analysis  is  complete,  then   if  metadata  contains  sufficient   informaFon  for  FASTQ  renaming.  Also   checks    and  starts  analysis  workflow  in   samplesheet     Renames  FASTQ  files   Monitor  data   monitor_data.sh   Monitor  run  status   monitor_run_status_miseq.sh   Monitor  metadata   monitor_metadata.sh   Rename  FASTQs   rename_files.py   Syncing   MiSeq   miseq_rsync.sh   Copies  miseq  run   directory  to  server     (every  hour)   Workflow   /storage/local/sw/ system_automagic/ workflows   Depending  on  the   workflow  defined   in  samplesheet   (  Project)  
  36. 36. AutomaFc  data  populaFon  for  sample  sheets  
  37. 37. HiSeq   MiSeq   ProducFon:   Taffy   Centos     11TB   Development:   S’Box   Centos   1  TB  Systems   3TB  Sandbox   ProducFon:   MiSeq  Reporter   PC   Windows   Backup:   Biocube   Centos   RAID6  20TB   Image  systems  at   install  and  aPer   update   Snapshots:   14  12-­‐hourly   4  weekly   12  monthly   3  yearly   Snapshots  (System,   Scripts  +  SoPware   only)   Transfer   Data  as   needed   Transfer  new   runs  and   backup   analysis   Backup:   ITS  Tape   (2  Copies)   Most  recent   Snapshot  every   6  month  for  3   years  rolling   Required  addi2onal  Resources:   1.  MiSeq  Reporter  Desktop  PC   2.  New  Desktop  PC  for  Seb   3.  ITS  Account   4.  (NAS  to  expand  Biocube)     Costs  Es2mate:   1.  2x  Desktop  PCs:  3  –  4k   2.  ITS  tapes:  max  11k/3years  for  20TB   3.  (NAS:  approx  10k  for  28TB)   Run  Raw  Data  to  keep:   1.  MiSeq:  Complete  Run  Folder   2.  HiSeq:  Run  Folder  excl.  Bcl,  image  files  etc.  (fastq  =  raw  data)   3.  Compress  data  aPer  1  month  (excl.  fastq)   4.  Discard  external  data  aPer  2  month  
  38. 38. NGS-­‐DB   Projects   Project_ID   Project_DescripFon   User   Experiments   Exp_ID   Project_ID   Exp_DescripFon   Prep_Method   Kit_ID  (if  applicable)   Pipeline(s)   LibOperator   Samples  [LIMS?]   Sample_ID   PaEent_ID   Family_ID   Sample_Type   Sample_Source   Sample_Conc   QC_Score   Libraries   Library_ID   Sample_ID   Exp_ID   Barcode_1   Barcode_2   LibConc   QC_P/F   Pool_IDs   MiSeqRun_IDs   HiSeqRun_Ids  (FC+Lane)   MiSeqRuns   MiSeqRun_ID   MiCartridge_ID   MiSeq_LoadConc   SeqOperator   MiFC_ID   MiSeq_RunDate   MiSeq_ClustDens   MiSeq_PFClustDens   MiSeq_Reads   MiSeq_PFReads   ...   HiSeqRuns   HiSeqRun_ID   HiFC_ID   SeqOperator   HiSeq_LoadConc   SBS_RGT   Cluster_RGT   cBot_RGT   HiSeq_RunDate   HiSeq_ClustDens   HiSeq_PFClustDens   HiSeq_Reads   HiSeq_PFReads   ...   HiSeq_Flowcells   HiFC_ID   HiFC_CatNo   HiFC_LotNo   HiFC_Type   HiFC_expiry   HiSeqRun_ID   MiSeq_Flowcells   MiFC_ID   MiFC_CatNo   MiFC_LotNo   MiFC_Type   MiFC_expiry   MiSeqRun_ID   HiSeq_SBS   (clust,cBot)   SBS_RGT   SBS_expiry   SBS_CatNo   SBS_LotNo   SBS_Type   HiSeqRun_ID   MiSeq_Cartridges   Cart_ID   Cart_CatNo   Cart_LotNo   Cart_TypeNo   Cart_expiry   MiSeqRun_ID   LibraryPools   Pool_ID   Library_IDs   Exp_IDs   Barcodes_1   Barcodes_2   PoolConc   QC_P/F   MiSeqRun_ID(s)   HiSeqRun_ID(s)   HiSeqRun_Ids  (FC+Lane)   User   Name   InsFtuFon   Contact  Info   Billing  Info   ...   Others   Pipelines   LibOperators   SeqOperators   Kits   PrepMethods  
  39. 39. AlternaFves  to  read  mapping  and   alignment   •  Grouped  read  tesFng   – Amplivar,  AmbiVert   •  Tiled  matching   – MIST   •  kmer  subtracFon   – Diamund   •  DetecFon  of  allele  expansions  
  40. 40. coverage statistics each amplicon read sorted by primers grouped amplicon variants grab amplicons sort by locus group amplicons Edit Disitance Read Counts, Read Distribution and Analysis Using the amplimer sequence to grab each amplicon is an alternative to querying the entire sequence output with the advantage that each set of reads should be more homogenous and amenable to grouping. Groups above a certain abundance (corresponding to the detection limit selected) and then be compared in detail with the canonical sequence using string comparison tools such as the Levenshtein (edit) distance, or by Smith-Waterman alignment. Using this approach we have confirmed that variants can be identified de novo, but with more interference from sequence errors than by grouped read typing. We have also shown that the current TruSeq Cancer Panel kit co-amplifies a region of chromosome 22 containing a perfect match to the pathogenic KIT Exon 11 c.1669T>A mutation. Artifactual data from the duplicated region risks the reporting of specious variants as false positive results.0" 1000" 2000" 3000" 4000" 5000" 6000" NRAS1_7_2"chr1"1 1525652831 15256531" NRAS8_13_3"chr1"1 1525873031 15258748" PIK 3CA1_20"chr3"1 7891687631 78916876" PIK 3CA2_21"chr3"1 7892155331 78921553" PIK 3CA3_22"chr3"1 7892798031 78927980" PIK 3CA4_11_23"chr3"1 7893607431 78936095" PIK 3CA12_24"chr3"1 7893886031 78938860" PIK 3CA13_20_25"chr3"1 7895200731 78952150" PIK 3CA13_20_26"chr3"1 7895200731 78952150" KIT 1_36"chr4"5 556176435 5561764" KIT 2_37"chr4"5 559218535 5592186" KIT 3_19_38"chr4"5 559346435 5593689" KIT 3_19_39"chr4"5 559346435 5593689" KIT 3_19_40"chr4"5 559346435 5593689" KIT 20_21_41"chr4"5 559422135 5594258" KIT 22_42"chr4"5 559551935 5595519" KIT 23_43"chr4"5 559749535 5597497" KIT 24_28_44"chr4"5 559932035 5599348" KIT 29_45"chr4"5 560269435 5602694" EGFR1_74"chr7"5 521108035 5211080" EGFR2_75"chr7"5 522182235 5221822" EGFR3_76"chr7"5 523304335 5233043" EGFR4_77"chr7"5 524167735 5241708" EGFR9_78"chr7"5 524241835 5242511" EGFR44_79"chr7"5 524900535 5249131" EGFR44_80"chr7"5 524900535 5249131" EGFR54_81"chr7"5 525951435 5259524" BRAF1_92"chr7"1 4045312131 40453193" BRAF28_93"chr7"1 4048139731 40481478" PTEN1_110"chr10"8 962424238 9624244" PTEN3_111"chr10"8 968530738 9685307" PTEN4_112"chr10"8 971189338 9711900" PTEN7_113"chr10"8 971761538 9717772" PTEN7_114"chr10"8 971761538 9717772" PTEN13_115"chr10"8 972071638 9720852" PTEN13_116"chr10"8 972071638 9720852" KRAS1_140"chr12"2 537856232 5378562" KRAS2_141"chr12"2 538027532 5380283" KRAS7_142"chr12"2 539825532 5398285" Average'Read'Count'per'Amplicon'+/6'SEM' Smith-Waterman BLAST 363#reads;#common#mutation:#p.G12A;#chr12:25,398,290C>G###c.35G>C TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 220#reads;#wildtype TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 64#reads;#nonEcoding#chr12:295,398,329C>T TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 39#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C TGTATTGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 27#reads;#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGATTATATTAGAACATGTCACACATAAGGTTA 26#reads#two#errors,#non#adjacent,#one#corresponding#to#c.35G>C TGTATCGTCAAGGCACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGTAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 17#reads;##two#nonEcoding#errors,#non#adjacent TGTATCGTCAAGGCACTCTTGCCTACGCCACCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCCGCAGGCTTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA 17#reads;#corresponding#to#c.35G>A TGTATCGTCAAGGTACTCTTGCCTACGCCAGCAGCTCCAACTACCACAAGTTTATATTCAGTCATTTTCAGCAGGCCTTATAATAAAAATAATGAAAATGTGACTATATTAGAACATGTCACACATAAGGTTA >1141_>EGFR9_78 chr7 55242418-55242511 1141 GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT >646_>EGFR9_78 chr7 55242418-55242511 646 GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT >60_>EGFR9_78 chr7 55242418-55242511 60 GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTTCATGGCT >57_>EGFR9_78 chr7 55242418-55242511 57 GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGTTTTGCTGTGTGGGGGTCCATGGCT >54_>EGFR9_78 chr7 55242418-55242511 54 GACTTTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT >51_>EGFR9_78 chr7 55242418-55242511 51 GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCTTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAGGAATTAAGAGAAGCAACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT GACTCTGGATCCCAGAAGGTGAGAAAGTTAAAATTCCCGTCGCTATCAAG---------------ACATCTCCGAAAGCCAACAAGGAAATCCTCGATGTGAGTTTCTGCTTTGCTGTGTGGGGGTCCATGGCT KRAS point mutation EGFR 15 base deletion EGFR 15 base deletion with Smith-Waterman alignment Amplicon  Sequencing:     treaFng  reads  as  groups  
  41. 41. Amplivar  vs.  Alignment   Amplivar   Alignment  (e.g.  BWA)   Groups  reads   Uses  individual  reads   Designed  for  amplicons   Designed  for  randomly   sheared  fragments   Works  with  FASTA  aPer   filtering   Works  with  FASTQ   Matches  against  target  list   Aligns  against  whole  genome   Alignment  is  an  opFonal  late   stage   Alignment  is  a  required  early   stage  
  42. 42. quality_filter   Hard  coded  quality  filters   Output  FASTA  and  qcore  files   FASTA  files  (.fna)  and  quality   files  (.csv)  wrilen  to  merged   folder  
  43. 43. fastq  and  fasta   >MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name! TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC! @MISEQ-2:20:000000000-A61NM:1:1101:12299:1738 1:N:0:some_name! TGCGTCATCATCTTTGTCATCGTGTACTACGCCCTGATGGCTGGTGTGGTTTGGTTTGTGGTC! +! AAAAADAFFFFFGGGFGGFGGFHFGFHHFGAEGIIIIIIIIIIIIIIIIIIIIIIIIIIIIII!   FASTQ   FASTA  
  44. 44. group_fasta_reads   •  @file_list  =  glob  "$merged_dir/*fna"  ;  
  45. 45. Grouped  reads   3 !AGACAACTGTTCAAACTGATGGGACCCACTCCATCGAGATTTCACTGTAGCTAGACCAAAATAG! ! 1 !ACCACTTTTGGAGGGAGATTTCGCTCCTGAAGAAAATTCGACAGCTTTGTGCCTGGCTAATTCT! ! 527!AGTGTATCCATTTTCTTCTCTCTGACCTTTGGCCCCCTACATCGACCATTCTGCAAGGTTAACA! ! 1 !CTCACCCCCAGACTGGGTTTTTAGGTCTCGGTTTACAAGTTTCTTATGCTGATGCTGAAAAAAA!
  46. 46. Usual  suspects  file   4  column  tab  separated  text  file  with  Unix  line  endings   •  Column  1:  RefSeq  idenFfier     •  Column  2:  cDNA  HGVS  nomenclature   •  Column  3:  codon  change  HGVS  nomenclature   •  Sequence  to  match   •  Usual  suspects  files  available  for  TruSeq  cancer  panel  and  for   PCRbrary   RefSeq cDNA*description codon*change sequence BRAF_NM_004333.4 c.1798G V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1798_1799delinsAA V600K CTCCATCGAGATTTCTTTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1798_1799delinsAG V600R CTCCATCGAGATTTCCTTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1798G>A V600K CTCCATCGAGATTTCATTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1799T V600 CTCCATCGAGATTTCACTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1799_1800delinsAA V600E CTCCATCGAGATTTTTCTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1799_1800delinsAT V600D CTCCATCGAGATTTTACTGTAGCTAGACCAAA BRAF_NM_004333.4 c.1799T>A V600E CTCCATCGAGATTTCTCTGTAGCTAGACCAAA
  47. 47. Genotype  table   library BRAF+600+ wt+V600 BRAF+600+ c.1799T> A+V600E KRAS+12+ &+13+wt+ G12/G13 KRAS+12+ c.34G>A+ G12S KRAS+12+ c.34G>C+ G12R KRAS+12+ c.34G>T+ G12C KRAS+12+ c.35G>A+ G12D KRAS+12+ c.35G>C+ G12A KRAS+12+ c.35G>T+ G12V KRAS+13+ c.38G>A+ G13D DL130016FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 74.2 0.0 0.0 DL130028FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 30.6 0.0 0.0 DL130040FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 50.0 0.0 0.0 DL130052FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 130.8 0.0 0.0 DL130064FTGx120036 100.0 0.4 100.0 0.0 0.0 0.0 0.0 300.0 0.0 0.0 DL130076FTGx120036 100.0 0.0 100.0 0.0 0.0 0.0 0.0 126.5 0.0 0.0 DL130018FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130030FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130042FTGx120041 100.0 0.0 100.0 1.5 0.0 0.0 0.0 0.0 0.0 0.0 DL130054FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130066FTGx120041 100.0 0.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130078FTGx120041 100.0 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130015FTGx120044 100.0 49.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 3.4 DL130027FTGx120044 100.0 45.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130039FTGx120044 100.0 54.9 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130051FTGx120044 100.0 32.5 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 DL130063FTGx120044 100.0 45.3 100.0 0.0 0.0 0.0 0.7 0.0 0.0 0.0 DL130075FTGx120044 100.0 43.2 100.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  48. 48. Amplicon  flank  file   4  column  tab  separated  text  file  with  Unix  line  endings   •  Column  1:  amplicon  idenFfier  with  unique  number     •  Column  2:  size  (not  currently  calculated)   •  Column  3:  co-­‐ords   •  Flanks  with  (.*)  to  capture  the  sequence  between  amplimers   •  Flank  files  available  for  PCRbrary,  TruSeq  Cancer  Panel  and   Olga’s  TruSeq  panel   ID Size Co)ords Sequence 1_MPL1_2 175 chr1:43815006-43815137 GCCGTAGGTGCGCACG(.*)TCAGCAGCAGCAGG 2_NRAS1_7 175 chr1:115256526-115256653 GCATTCCCTGTGGTTTT(.*)AGAGTACAGTGCCATG
  49. 49. Grouping  Reads   amplivar –i * –j * –k *! ! -i /storage/local/sandbox/working_directory! -j usual_suspects.txt ! -k flanking_primers! Merges  read  pairs,  quality  filters,  converts  to  fasta,  groups  by  sequence   Counts  reads  corresponding  to  each  amplicon   Genotypes  according  to  the  usual  suspects  table  with  read  counts   Groups  reads  by  amplicon  for  mutaFon  scanning    
  50. 50. AMPLIVAR   SeqPrep:     Remove  adapters  &   Merge  reads   Filter  reads  by  quality   Convert  fastq2fasta   Group  fasta  reads   Genotype  grouped   reads   Grab  reads  by  flanks   Sort  reads  by  locus  
  51. 51. AMPLIVAR  WRAPPER   AMPLIVAR   SeqPrep:     Remove  adapters  &   Merge  reads   Filter  reads  by  quality   Convert  fastq2fasta   Group  fasta  reads   Genotype  grouped   reads   Grab  reads  by  flanks   Sort  reads  by  locus   Create  symbolic  links   for  fastqs   Create  subdirectories   for  each  fastq  file  pair   (R1  &R2)   Run  amplivar   Run  sort  amplicons   Run  blat  on  grouped,   sorted  reads   Convert  blat  psl2sam,   sam2bam   Run  bamleYalign   Inflate  bam   Run  VarScan   Run  VEP  
  52. 52. AmpliVar  Required  tools   •  SeqPrep  (C)   •  Blat  (C)   •  Samtools  (C)   •  BamlePalign  (C++)   •  VarScan  (java)   •  Bash   •  Perl   •  Python  
  53. 53. >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC734 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACAAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGCTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAGCCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGGTCTGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC4 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCAGGGGTCACAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC3 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCAAAGAGCGAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC3 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCAAACCAAGAATGCCTGTTTACAGGCCTTCGGCTCCACCTGGTC3 >1_MPL1_2(175(chr1:43815006443815137 GGCGGTGGACGGAGATCTGGGGTCACAGAGCGAACCAAGAATGCCTGTCTACAGGCCTTCGGCTCCACCTGGTC3 Sorted,  locus  (amplicon)-­‐based  files  
  54. 54. Sample Amplicon Read %/of/major/allele MS0318_1164/Blood 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100% MS0323_1164/Frozen 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100% MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 100% MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTTATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATCTAGGTAATATTTCATCT 28.40% MS0313_1164/FFPE 1030_BRCA1_exon14/175/chr17:41226334C41226449 AAGAGGAGCTCATTAAGGTTGTTGATGTGGAGGAGCAACAGCTGGAAGAGTCTGGGCCACACGATTTGACGGAAACATCTTACTTGCCAAGGCAAGATTTAGGTAATATTTCATCT 26.20% Most  reads  are  error  free   FFPE  contaminates  the  evidence  
  55. 55. Clustal  alignment  &   phylogeny  of  errors  
  56. 56. Merged  forward  and  reverse  reads   PandaSeq  and  SeqPrep  can  merge  overlapping  read  pairs  to  make  them  even  longer   and  more  accurate.  In  an  unselected  100  base  read  pair  run  enriched  for  hereditary   cancer  genes  over  20%  of  the  reads  could  be  merged.  With  longer  reads  and  suitable   experimental  design  this  fracFon  could  be  increased.   Pairs&Processed: 18,456,760 Pairs&Merged: 4,029,383 Pairs&With&Adapters: 32,899 Pairs&Discarded: 646 percent&merged 21.83 locus 1'total 2'total chr17'41223060'41223115 2631 2777 chr17'41223076'41223130 2452 2674 chr17'41223101'41223146 2223 2501 0' 500' 1000' 1500' 2000' 2500' 3000' chr17'41223060'41223115' chr17'41223076'41223130' chr17'41223101'41223146' 1'total' 2'total' Probability  of    a  given  length  read  as  a  subset  of  a  longer  read    in  a  normal   distribuFon  of  longer  reads:  the  “minimum  substring  problem”   This  approach  might  make  sense  with    longer  reads   Average  read  length  from  101  to  150  bases  
  57. 57. Orthogonal  validaFon  without  Sanger  Scale :chr17 ---> RefSeq Genes RepeatMasker 10 baseshg19 41,223,07041,223,07541,223,08041,223,08541,223,09041,223,09541,223,10041,223,105 AGTCATCATACTCGTCGTCGACCTGAGACCCGTCTAAGA flanks Your Sequence from Blat Search RefSeq Genes Simple Nucleotide Polymorphisms (dbSNP 137) Found in >= 1% of Samples Duplications of >1000 Bases of Non-RepeatMasked Sequence Repeating Elements by RepeatMasker CCAGCAGTATCAGTA(.*) TATGAGCAGCAGCTG(.*)CAATTGGGGAACTTT AGATTCTGCAACTTT(.*)CAATGCAGAGGTTGA G YourSeq rs1799966 0" 50" 100" 150" 200" 250" 300" 350" 400" 450" 500" """GCCCAGAGTCCAGCTGCTGCTCATAC"""GCCCAG*G*GTCCAGCTGCTGCTCATAC" Forward  reads  inc  rs1799966  
  58. 58. Downstream  processing  of  FASTA  files   BWA,  BLAT,  annotaFon   VEP  
  59. 59. MIST  
  60. 60. A schematic of the workflow used by MiST. Subramanian S et al. Nucl. Acids Res. 2013;41:e154
  61. 61. Identification of potentially paralogous read pairs. Subramanian S et al. Nucl. Acids Res. 2013;41:e154
  62. 62. Motivation for the use of Geoseq in variant calling. Subramanian S et al. Nucl. Acids Res. 2013;41:e154
  63. 63. Comparison of MiST and GATK. Each box has three sets of numbers, from left to right they are variant calls, (i) unique to MiST, (ii) common to both platforms and (iii) unique to GATK. Filters are applied to remove calls occurring in public databases like dbSNP (17), 1000 Genomes (18) and a collection of already known private variants. Subramanian S et al. Nucl. Acids Res. 2013;41:e154
  64. 64. DIAMUND:  Direct  Comparison  of   Genomes  to  Detect  Muta2ons     Figure 1. Outline of initial steps in the Diamund algorithm, which identifies all k-mers unique to an affected proband and missing from both unaffected parents. The first step identifies k-mers, after which the proband data are filtered to remove k-mers resulting from sequencing errors. Intersecting all three sets identifies k-mers that are unique to the proband. sequencing, where the number of true but clinically irrelevant vari- ants will be 50 times greater. Here, we introduce a new method, DIAMUND (direct alignment for mutation discovery), which takes a different approach to exome and whole-genome analysis, and as a result produces dramatically smaller sets of candidate mutations. Rather than aligning all samples to the reference genome, we align the sequences directly to one another.Thismethodisdesignedprimarilyfortwotypesofanalyses: with X-linked diseases. Of these, 83% of the autosomal dominant and 40% of the X-linked mutations occurred de novo. In addition to generating fewer false positives, direct comparison between samples within a family, or between affected and unaf- fected tissue, allows for detection of mutations in regions that are entirely missing from the reference genome. It has already been shown that some human populations have large shared genomic regions, often spanning many megabases [Li et al., 2010], which are
  65. 65. Filtering  staFsFcs   Table 1. Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutated Loci Data remaining at the end of step Filtering step Disease/normal pair Family trio BH1019 Family trio BH2041 Family trio BH2688 Number of reads from proband/diseased tissue 118,414,556 84,201,820 75,877,750 103,527,644 Number of 27-mers in proband/diseased tissue 911,738,627 795,477,167 517,272,851 1,088,610,020 Number of k-mers with count >10 77,903,885 61,805,320 64,719,150 113,066,951 Remove vector sequence 77,898,848 61,800,798 64,713,995 113,062,417 Eliminate k-mers found in reference GRC37 exome 17,821,359 9,385,347 10,730,208 50,535,681 Eliminate k-mers found in parent exomes/normal tissue 10,568 65,352 20,130 2,006 Identify reads containing k-mers 32,829 reads 148,496 46,454 4,404 Remove reads containing vector 15,260 125,648 38,799 2,760 Number of contigs after assembly 2,147 13,189 3,755 359 Number of contigs with >3 reads after merging contigs 279 contigs 1,437 701 71 Identify variants covered by reads from normal tissue 55 contigs 5 6 2 Keep variants with >5% coverage 42 variants 5 6 2 Find variants in coding regions 14 variants 3 3 1 Remove synonymous SNPs 10 variants 2 3 1 Step 1: We utilize an efficient parallel algorithm, Jellyfish [Marcais and Kingsford, 2011], for the k-mer counting step. This first step converts the reads for each exome (or genome) to a set of k-mers, which should in theory be a much smaller data set: the number of k-mers in an exome is equivalent to the length of the exome, 50–60 Mbp using current exome capture kits. However, the initial set is dramatically larger, due primarily to sequencing errors, which we address below. We sort each set of k-mers to allow for efficient intersection operations in subsequent steps. Sorting the diseased tissue, in the case of normal vs. diseased tissue comparisons). We also observe that any k-mer that occurs in the reference genome is probably not the cause of disease. We precompute all k-mers from the targeted regions of the GRC37 genome, and re- move these “normal” k-mers from the proband’s data. Note that this set can easily be expanded to include a larger set of variants known to be harmless.
  66. 66. Panagopoulos  et  al.   Plosone  2014    Volume  9  (6)  e99439   The  ‘‘Grep’’  Command  But  Not  FusionMap,  FusionFinder  or  ChimeraScan   Captures  the  CIC-­‐DUX4  Fusion  Gene  from  Whole  Transcriptome  Sequencing   Data  on  a  Small  Round  Cell  Tumor  with  t(4;19)(q35;q13)   Three  fusion-­‐finder  programs  FusionMap,  Fusion  Finder,  and  ChimeraScan   generated  a  plethora  of  fusion  transcripts  but  not  the  biologically  important   and  cancer-­‐specific  fusion  gene,  the  CIC-­‐  DUX4  chimeric  transcript.  It  was   necessary  to  use  the  ‘‘grep’’  command-­‐line  uFlity  to  siP  out  the  laler  from  the   many  data  produced  by  the  automated  algorithms.  CytogeneFc,  FISH,  and   clinico-­‐pathologic  tumor  features  hinted  at  the  presence  of  the  said  fusion,  but   it  was  eventually  found  only  aPer  the  manual  ‘‘grep’’-­‐  funcFon  had  been  used.   Simple  is  good  
  67. 67. Fig. 1. Sensing and locating short tandem repeats (STRs) in short reads. (A) An original short read. (B) An approximate STR (AGAGGC)n (n=6) in the short read. The central four copies of AGAGGC are an exact STR with no mutations, while the flanking copies contain the mutations shown in bold letters. If one of the regions (black) surrounding the STR aligns in a unique position, the STR can be located in the genome. (C) A read occupied by an approximate STR. (D) Sensing STRs from frequency distributions of (AGAGCC)n in NA12877 (father of the HapMap CEU trio), NA12878 (mother), and NA18507 (an African male). The x-axis is the lengths of STR occurrences detected in a read, and the y-axis is the frequency of reads containing STR occurrences of the length indicated on the x-axis. Note that 100-bp-long STR occurrences are frequent in NA12877, while no STR occurrences of length >70 bp are observed byguesthttp://bioinformatics.oxfordjournals.org/Downloadedfrom Rapid  detecFon  of  expanded  short  tandem  repeats     in  personal    genomics  using  hybrid  sequencing     Koichiro  Doi,  Taku  Monjo,  Pham  H.  Hoang,  Jun  Yoshimura,  Hideaki  Yurino,  Jun  Mitsui,  Hiroyuki  Ishiura,  Yuji  takahashi,  Yaeko   Ichikawa,  Jun  Goto,  Shoji  Tsuji  and  Shinichi  Morishita     University  of  Tokyo  
  68. 68. Standards  
  69. 69. Human  Variome  Project  
  70. 70. Prototype  NGS  database  
  71. 71. Report  
  72. 72. Sharing  Experience  with  TruSight  One   •  In  partnership  with  Illumina,  RCPA  and  the  HGSA   Kim  Flintoff  (Wellington  Regional  GeneFcs   Laboratory)  is  leading  an  evaluaFon  of  exon   sequencing  using  Illumina’s  True  Sight  One  panel.     Two  Coriell    family  trios  will  be  sequenced   by  New  Zealand  Genomics  Limited  and  the  data   will  be  shared  on  a  HVPA  database   •  The  VCF  file  will  be  available  on  the  HVPA  LOVD   database  and  performance  stats  will  also  be   made  available.  
  73. 73. Next  Steps   •  Robust  standards  for  genomic  medicine   •  Databases  and  data  content   –  Access  to  idenFfied  and  de-­‐idenFfied  data  (consent   and  confidenFality)   –  Database  accreditaFon  process  in  prep  with  RCPA   –  Defining  the  performance  of  various  aligners,  variant   callers  and  annotaFon  programs   –  Clinical  grade  Variant  Call  Format  (VCF)   –  Metafile  covering  data  trail:  what  was  tested,  what   was  not  tested    
  74. 74. Data  quality  classes   DifferenFate  between  three  classes  of  data:     The  Clinically  Reported  data  label  would  denote  the  class  of  data  that  the  HVP   Australian  Node  was  originally  designed  to  collect  and  share:  data  that  has  been   generated  in  a  NATA  accredited  Australian  diagnosFc  laboratory  and  is  able  to  be   included  in  a  clinical  report.     Unreported  Clinical  quality  data  would  denote  data  that  has  been  generated  in  a   NATA  accredited  diagnosFc  laboratory,  but  is  not  capable  of  being  included  in  a   clinical  report.  This  class  would  comprise,  primarily,  of  next-­‐generaFon   sequencing  (NGS)  type  data.     Unaccredited  data  would  be  used  to  denote  data  that  has  been  generated  by  an   Australian  laboratory  that  has  not  been  NATA  accredited     A  new  filtering  opFon  would  be  made  available  to  allow  users  to  view  only  data   of  a  certain  class  
  75. 75. Standards  for  AccreditaFon  of  DNA   Sequence  VariaFon  Databases   Quality  Use  of  Pathology  Program  (QUPP),  a  naFonal  project  for  the  Development  of  Standards  for   AccreditaFon  of  DNA  Sequence  VariaFon  Data  Bases  has  been  jointly  iniFated  by  the  Royal  College  of   Pathologists  of  Australasia  (RCPA),  and  the  Human  Variome  Project  (HVP).   Background   •  There  is  a  rapidly  increasing  volume,  spectrum,  and  complexity  of  geneFc  tests  emerging  within   diagnosFc  pathology  laboratories.  In  parFcular,  high  throughput  sequencing  methods  such  as   targeted  panel,  exome  (WES),  and  whole  genome  sequencing  (WGS),  are  producing  an  increasing   quanFty  of  geneFc  data  requiring  analysis  and  interpretaFon,  forming  a  substanFal  proporFon  of   the  workload.   •   Currently,  there  is  a  plethora  of  online  mutaFon  databases  to  refer  to,  however  there  is  a  disFnct   lack  of  such  databases  that  meet  the  stringent  accuracy  and  reproducibility  that  the  clinical   diagnosFc  environment  demands.  AddiFonally,  The  current  databases  are  “Fractured”,  with  varied   access  and  sharing  of  the  data  within;  and  variable  quality  due  to  errors  /  inaccurate  data  posFng,   all  of  which  is  a  clear  risk  to  the  quality  of  paFent  care.  With  more  widespread,  secure  sharing  of   variants  and  associated  phenotypes,  the  value  of  cumulaFve  variant  informaFon  will  accelerate  the   delivery  of  accurate,  acFonable,  and  efficient  clinical  reports.   •  There  are  currently  no  standards  or  equivalent  mechanisms  for  accreditaFon  of  databases  to   ensure  the  accuracy  and  quality  of  uploaded  data  into  any  central  repository  to  meet  the  needs  of   the  clinical  diagnosFcs  environment.  
  76. 76. Pathogenicity   1.  “Deleterious-­‐  and  Disease-­‐Allele  Prevalence  in  Healthy  Individuals:   Insights  from  Current  PredicFons,  MutaFon  Databases,  and  PopulaFon-­‐ Scale  Resequencing”  Yali  Xue,  Yuan  Chen,  Qasim  Ayub,  Ni  Huang,   Edward  V.  Ball,  Malhew  Mort,  Andrew  D.  Phillips,  Katy  Shaw,  Peter  D.   Stenson,  David  N.  Cooper,  Chris  Tyler-­‐Smith,  and  the  1000  Genomes   Project  ConsorFum  Am  J  Hum  Genet  91,  1022–1032  2012   2.  “Amino  Acid  Changes  in  Disease-­‐Associated  Variants  Differ  Radically   from  Variants  Observed  in  the  1000  Genomes  Project  Dataset”  Tjaart  A.  P.   de  Beer*,  Roman  A.  Laskowski,  Sarah  L.  Parks,  Botond  Sipos,  Nick  Goldman,  Janet  M.   Thornton  PLOS  Comp  Biol,  9  1-­‐15  2013   3.  “Large  Numbers  of  GeneFc  Variants  Considered  to  be  Pathogenic  are   Common  in  AsymptomaFc  Individuals”  Christopher  A.  Cassa,  Mark  Y.  Tong,  and   Daniel  M.  Jordan  HuMu  34.  9  1216–1220,  2013   4.  “Integrated  sequence  analysis  pipeline  provides  one-­‐stop  soluFon  for   idenFfying  disease-­‐causing  mutaFons”  Hao  Hu  ,  Thomas  F  Wienker,  Luciana   Musante,  Vera  MM  Kalscheuer,  Peter  N  Robinson,  H  Hilger  Ropers  HuMu  under  review    
  77. 77. Table  1.  List  of  selected  CNV  detec2on  methods.   Duan  J,  Zhang  J-­‐G,  Deng  H-­‐W,  Wang  Y-­‐P  (2013)  ComparaFve  Studies  of  Copy  Number  VariaFon  DetecFon  Methods  for  Next-­‐GeneraFon   Sequencing  Technologies.  PLoS  ONE  8(3):  e59128.  doi:10.1371/journal.pone.0059128   hlp://www.plosone.org/arFcle/info:doi/10.1371/journal.pone.0059128  
  78. 78. Summary   •  Current  sequencing  technology  has  plenty  of   room  for  improvement  w.r.t.  read  length  and   accuracy   •  Many  informaFcs  challenges  relate  to  managing   poor  quality  data  or  technological  limitaFons  and   will  go  away  with  longer,  more  accurate  reads   •  AnnotaFon,  data  sharing  and  integraFng  variant   data  with  clinical  and  phenotypic  data  are  the   high  value  healthcare  deliverables  
  79. 79. Acknowledgments   •  Genomic  Medicine  &  Centre  for  TranslaFonal   Pathology,  University  of  Melbourne:  Arthur  Hsu,   Olga  Kondrashova,  SebasFan  Lunke,  Clare  Love,   Renate  Marquis-­‐Nicholson,  Kym  Pham,  Paul  Waring   •  Human  Variome  Project:  Tim  Smith,  Alan  Lo,  Dick   Colon   •  Melbourne  Genomics  Health  Alliance:  Clara  Gaff,   Kathryn  North,  Doug  Hilton,  Stephen  Smith  
  80. 80. Targeted  Tumour  Sequencing:     © 2014 Illumina, Inc. All rights reserved. BWA Enrichment, Version 1.0.0.1 Page 2 Sample Information Sample ID: TL140380 Sample Name: TL140380 Total PF Reads: 77,538,750 Percent Q30: 78.6% Adapters Trimmed: Yes Median Read Length: 151 bp Enrichment Summary Target Manifest Total Length of Targeted Reference Padding Size TruSight One v1.0 11,946,514 bp 150 bp Note: All enrichment values are calculated without padding (sequence immediately upstream and downstream) unless otherwise stated. Read Level Enrichment Total Aligned Reads Percent Aligned Reads Targeted Aligned Reads Read Enrichment Padded Target Aligned Reads Padded Read Enrichment 75,690,682 97.6% 49,355,753 65.2% 57,751,970 76.3% Base Level Enrichment Total Aligned Bases Targeted Aligned Bases Base Enrichment Padded Target Aligned Bases Padded Base Enrichment 10,449,595,970 4,976,953,404 47.6% 7,636,013,223 73.1% Enrichment Sequencing Report Coverage Summary Mean Region Coverage Depth Uniformity of Coverage (Pct > 0.2*mean) Target Coverage at 1X Target Coverage at 10X Target Coverage at 20X Target Coverage at 50X 416.6X 96.1% 99.4% 99.1% 98.9% 97.9% Enrichment Sequencing Report Coverage Summary Mean Region Coverage Depth Uniformity of Coverage (Pct > 0.2*mean) Target Coverage at 1X Target Coverage at 10X Target Coverage at 20X Target Coverage at 50X 416.6X 96.1% 99.4% 99.1% 98.9% 97.9% ConsFtuFonal   Frozen   Enrichment Sequencing Report Small Variants Summary SNVs Insertions Deletions Total Passing 8,113 192 230 Percent Found in dbSNP 98.8% 87.5% 74.3% Het/Hom Ratio 1.7 1.8 2.5 Ts/Tv Ratio 3.1 - - Variants by Sequence Context Enrichment Sequencing Report Coverage Summary Mean Region Coverage Depth Uniformity of Coverage (Pct > 0.2*mean) Target Coverage at 1X Target Coverage at 10X Target Coverage at 20X Target Coverage at 50X 2555.9X 94.6% 99.5% 99.3% 99.3% 99.1% Enrichment Sequencing Report Coverage Summary Mean Region Coverage Depth Uniformity of Coverage (Pct > 0.2*mean) Target Coverage at 1X Target Coverage at 10X Target Coverage at 20X Target Coverage at 50X 2555.9X 94.6% 99.5% 99.3% 99.3% 99.1% Enrichment Sequencing Report Small Variants Summary SNVs Insertions Deletions Total Passing 8,244 184 255 Percent Found in dbSNP 98.8% 88.6% 70.6% Het/Hom Ratio 1.7 1.5 2.9 Ts/Tv Ratio 3.0 - - Variants by Sequence Context © 2014 Illumina, Inc. All rights reserved. BWA Enrichment, Version 1.0.0.1 Page 2 Sample Information Sample ID: WES001FR1 Sample Name: WES001FR1 Total PF Reads: 454,879,338 Percent Q30: 81.9% Adapters Trimmed: Yes Median Read Length: 151 bp Enrichment Summary Target Manifest Total Length of Targeted Reference Padding Size TruSight One v1.0 11,946,514 bp 150 bp Note: All enrichment values are calculated without padding (sequence immediately upstream and downstream) unless otherwise stated. Read Level Enrichment Total Aligned Reads Percent Aligned Reads Targeted Aligned Reads Read Enrichment Padded Target Aligned Reads Padded Read Enrichment 445,404,466 97.9% 305,571,144 68.6% 341,517,122 76.7% Base Level Enrichment Total Aligned Bases Targeted Aligned Bases Base Enrichment Padded Target Aligned Bases Padded Base Enrichment 60,040,931,299 30,533,612,135 50.9% 44,910,193,400 74.8%
  81. 81. Called  Variants   FFPE  Frozen   Blood   16,711   2,095   154   979   3,641   9,368   2,455  
  82. 82. Gene  List   GFR city. rect t of es a ittle has atly is.78 is iza- pre- fied ain- Figure 3 Frequency of major driver mutations in signaling ancer g et al
  83. 83. Filtering  Variants   All  variants   None   Qual   Not  in  Blood   Blood   9828   8551   NA   Frozen   9920   8736   126   FFPE   9709   8163   199   Variants  in  Gene  List   None   Qual   Not  in  Blood   Blood   27   18   NA   Frozen   27   23   2  (EGFR)   FFPE   25   19   3  (EGFR,  ROS)  
  84. 84. EGFR  p.L858R  
  85. 85. EGFR  p.T790M  
  86. 86. ConfirmaFon  by  PCR   0.0# 50.0# 100.0# 150.0# 200.0# 250.0# EGFR_NM _005228.3#T790#T790#W T# EGFR_NM _005228.3#784#"c.2350T>C,#p.S784P"# EGFR_NM _005228.3#784#"c.2351C>T,#p.S784F"# EGFR_NM _005228.3#785#"c.2354C>T,#p.T785I"# EGFR_NM _005228.3#786#"c.2356G>A,#p.V786M "# EGFR_NM _005228.3#790#"c.2368A>G,#p.T790A"# EGFR_NM _005228.3#790#"c.2369C>T,#p.T790M "# EGFR_NM _005228.3#828#&#861#"828#&#861,#w t"# EGFR_NM _005228.3#858#"c.2572C>A,#p.L858M "# EGFR_NM _005228.3#858#"c.2573_2574delinsGT,# EGFR_NM _005228.3#858#"c.2573T>A,#p.L858Q"# EGFR_NM _005228.3#858#"c.2573T>G,#p.L858R"# EGFR_NM _005228.3#860#"c.2579A>T,#p.K860I"# EGFR_NM _005228.3#861#"c.2582T>A,#p.L861Q"# EGFR_NM _005228.3#861#"c.2582T>G,#p.L861R"# EGFR%normalised%% 0.0# 0.2# 0.4# 0.6# 0.8# 1.0# 1.2# 1.4# 1.6# 1.8# 2.0# KRAS_NM _033360.2#12#"c.34G>A,#p.G12S"# KRAS_NM _033360.2#12#"c.34G>C,#p.G12R"# KRAS_NM _033360.2#12#"c.34G>T,#p.G12C"# KRAS_NM _033360.2#12#"c.35G>A,#p.G12D"# KRAS_NM _033360.2#12#"c.35G>C,#p.G12A"# KRAS_NM _033360.2#12#"c.35G>T,#p.G12V"# KRAS_NM _033360.2#13#"c.37G>A,#p.G13S"# KRAS_NM _033360.2#13#"c.37G>C,#p.G13R"# KRAS_NM _033360.2#13#"c.37G>T,#p.G13C"# KRAS_NM _033360.2#13#"c.38G>A,#p.G13D"# KRAS_NM _033360.2#13#"c.38G>C,#p.G13A"# KRAS_NM _033360.2#13#"c.38G>T,#p.G13V"# KRAS%normalised%%
  87. 87. Gene  list  summary   gene locus covered,by, capture, panel? Observations Notes AKT1 chr14:105,235,7611105,262,116 y wt activating:point ALK chr2:29,415,410130,146,821 y wt rearrangements:and:secondary:resistance:mutations BRAF chr7:140,433,8131140,624,564 y wt activating:point EGFR chr7:55,248,979155,259,567 Y L858R;:T790M activating:points,:indels,:ressitance:point HER2 chr17:37,844,393137,884,915 Y wt ampliication,:activating:point KRAS chr12:25,386,768125,403,863 Y ? activating:point MAP2K1 chr15:66,679,211166,783,882 Y changed:allele: ratio MET chr7:116,312,4591116,409,963 Y changed:allele: ratio mutation:and:amplification NRAS chr1:115,247,0851115,259,515 Y wt activating:point PI3KCA chr3:178,866,3111178,952,497 Y wt activating:point ROS1 chr6:117,609,5301117,747,018 Y wt rearrangements
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×