SlideShare a Scribd company logo
Learn	
  from	
  Prac,ce	
  
-­‐What	
  Tells	
  You	
  about	
  a	
  Problema,c	
  
NGS	
  
Dongyan	
  
Postdoctoral	
  Research	
  Associate	
  
Buell	
  Lab/Jiang	
  Lab	
  
2015.4.8	
  
Sources	
  affec,ng	
  NGS	
  
1.  Systema,c	
  varia,on	
  in	
  quality	
  scores	
  across	
  the	
  
sequence	
  read	
  
2.  Quality	
  trimming	
  and	
  cleaning	
  of	
  raw	
  reads	
  
3.  Biases	
  in	
  sequence	
  genera,on	
  driven	
  by	
  base	
  
composi,on	
  
4.  Contamina,on	
  from	
  known	
  and	
  unknown	
  species	
  
other	
  than	
  the	
  sequencing	
  target	
  
5.  NGS	
  libraries	
  on	
  assembly	
  quality	
  
6.  others	
  
7.  ………………………………….	
  
BASE	
  SEQUENCING	
  QUALITY	
  
Per	
  base	
  quality	
  score	
  
Forward	
  reads	
   Reverse	
  reads	
  
Sample	
  A	
  
Sample	
  B	
  
200	
  bp	
  
300	
  bp	
  
400	
  bp	
  
500	
  bp	
  
700	
  bp	
  
800	
  bp	
  
Library	
  QC	
  using	
  Bioanalyzer	
  	
  
Sample	
  A	
  
Sample	
  B	
  
Adapted	
  from	
  the	
  report	
  generated	
  by	
  Emily	
  Crisovan	
  (Buell	
  lab)	
  	
  
Cause	
  for	
  the	
  poor	
  base	
  quality	
  for	
  
Sample	
  B	
  
Illumina	
  flowcells	
  may	
  not	
  handle	
  longer	
  fragments	
  well	
   Bronner	
  et	
  al.,	
  2009	
  
diol diol
1st cycle
denaturation
1st cycle
annealing
diol diol
n=35
total
1st cycle
extension
diol diol diol diol
2nd cycle
denaturation
2nd cycle
annealing
dioldiol diol
Cluster	
  Genera7on:	
  Amplifica7on	
  
diol dioldiol
2nd cycle
extension
Adapted	
  from	
  Robin’s	
  slides	
  
Per	
  base	
  sequence	
  quality	
  
Before	
  cleaning	
   Aaer	
  cleaning	
  
Good	
  base	
  quality	
  is	
  the	
  start.	
  
QUALITY	
  TRIMMING	
  AND	
  ADAPTER	
  
REMOVING	
  OF	
  RAW	
  READS	
  
k-­‐mer	
  content	
  (residual	
  adapter	
  
sequences)	
  –paired-­‐end	
  reads	
  
Before	
  cleaning	
   Aaer	
  cleaning	
  
•  Only	
  happened	
  to	
  paired-­‐end	
  libraries	
  with	
  small	
  insert	
  size	
  (<400	
  bp).	
  
•  Not	
  happen	
  to	
  paired-­‐end	
  libraries	
  with	
  insert	
  size	
  greater	
  than	
  400	
  bp.	
  
k-­‐mer	
  content	
  
•  This	
  is	
  due	
  to	
  the	
  ‘reading	
  through’	
  a	
  short	
  
fragment	
  into	
  the	
  adapter	
  sequence	
  on	
  the	
  other	
  
end.	
  
•  The	
  default	
  threshold	
  of	
  the	
  clip	
  is	
  too	
  high?	
  
•  ILLUMINACLIP:TruSeq3-­‐PE.fa:2:30:10	
  
k-­‐mer	
  content	
  (residual	
  adapter	
  
sequences)	
  –mate	
  pair	
  reads	
  
Aaer	
  cleaning	
  and	
  grouping	
  reads	
  to	
  categories	
  using	
  NextClip	
  
•  Those	
  k-­‐mers	
  are	
  from	
  the	
  junc,on	
  adapter	
  
k-­‐mer	
  content	
  
•  Didn’t	
  want	
  to	
  lower	
  down	
  the	
  threshold	
  in	
  
case	
  it	
  may	
  clip	
  more	
  than	
  necessary	
  
•  Used	
  cutadapt	
  and	
  its	
  default	
  selng	
  to	
  
remove	
  the	
  residual	
  adapter	
  sequences	
  aaer	
  
trimmoma,c	
  and/or	
  NextClip	
  cleaning	
  
Residual	
  adapter	
  on	
  assembly	
  
w/	
  residual	
  
adapter	
  
w/o	
  residual	
  
adapter	
  
Never	
  rush	
  to	
  assembly	
  before	
  you	
  are	
  sure	
  you	
  have	
  a	
  high-­‐quality	
  and	
  ‘clean’	
  read	
  sets!	
  
BIASES	
  IN	
  SEQUENCE	
  GENERATION	
  
DRIVEN	
  BY	
  BASE	
  COMPOSITION	
  
Biases	
  in	
  sequence	
  genera,on	
  	
  
Paired	
  end	
  reads	
  
GC%:	
  33%	
  
Mate	
  pair	
  reads	
  
GC%:	
  40%	
  
CONTAMINATIONS	
  
Per	
  sequence	
  GC	
  content	
  
SGA	
  preQC	
  
Sample	
  A	
   Sample	
  B	
  
Contamina,ons?	
  
QC	
  
•  Map	
  reads	
  back	
  to	
  the	
  assembly	
  
•  Taxon-­‐Annotated	
  Gene	
  Content	
  
•  MAKER	
  annota,on	
  of	
  the	
  assembly	
  
•  OrthoMCL	
  analysis	
  
Mapping	
  reads	
  to	
  the	
  assemblies	
  
•  Assembled	
  reads	
  using	
  ABySS	
  
•  Map	
  reads	
  back	
  to	
  the	
  assembly	
  using	
  Bow,e/
1.0.0	
  in	
  single	
  end	
  mode	
  allowing	
  1	
  mismatch	
  
	
  	
   Sample	
  B	
  assembly	
  
reads	
   mapped	
   unmapped	
  
Sample	
  A	
   73.37%	
   26.63%	
  
Sample	
  B	
   60.94%	
   39.06%	
  
Contamina,ons?	
  
TAGC	
  
Sample	
  A	
   Sample	
  B	
  
hpps://github.com/blaxterlab/blobology	
  
hpps://github.com/mojones/blobsplorer	
  	
  
TAGC-­‐highlighted	
  a	
  phylum	
  
Sample	
  A	
  
Streptophyta	
  
TAGC-­‐highlighted	
  a	
  phylum	
  
Sample	
  B	
  
Proteobacteria	
  Streptophyta	
  
Maker	
  annota,on	
  
	
  Data	
  used	
   #con,g>1000bp	
  
Sample	
  A	
  assembly	
   	
  75,417	
  
Sample	
  B	
  assembly	
   92833	
  
•  EST	
  evidence	
  
•  caa_assembly.fasta	
  (Elsa)	
  
•  Protein	
  homology	
  evidence:	
  
•  uniprot_sprot_plants.fasta	
  	
  
•  TAIR10_pep_20110103_representa,ve_gene_model	
  
•  Repeat	
  masking-­‐default	
  
	
  	
   Sample	
  A	
   Sample	
  B	
  
Num_of_transcripts	
   	
  31,234	
  	
   	
  45,791	
  	
  
Max_len_trans	
   	
  14,796	
  	
   	
  29,577	
  	
  
Min_len_trans	
   	
  28	
  	
   	
  33	
  	
  
N50	
   	
  17,253,963	
  	
   	
  27,945,180	
  	
  
N50	
  transcript	
  size	
   	
  1,409	
  	
   	
  1,498	
  	
  
Average	
  transcript	
  size	
   	
  1,105	
  	
   	
  1,221	
  	
  
With	
  help	
  from	
  Kevin	
  Childs	
  
OrthoMCL	
  analysis	
  
•  OrthoMCL	
  DB	
  (web-­‐based)	
  
–  hpp://www.orthomcl.org/orthomcl/	
  	
  
–  search	
  against	
  predefined	
  sets	
  of	
  orthologous	
  groups	
  from	
  a	
  set	
  of	
  
organisms	
  
OrthoMCL	
  analysis	
  
Steps:	
  
1.	
  All-­‐vs-­‐all	
  BLASTP	
  of	
  the	
  proteins	
  
2.	
  Compute	
  percent	
  match	
  length	
  
	
  	
  	
  -­‐	
  Select	
  whichever	
  is	
  shorter,	
  the	
  query	
  or	
  subject	
  sequence.	
  Call	
  that	
  sequence	
  S.	
  
	
  	
  	
  -­‐	
  Count	
  all	
  amino	
  acids	
  in	
  S	
  that	
  par,cipate	
  in	
  any	
  HSP.	
  
	
  	
  	
  -­‐	
  Divide	
  that	
  count	
  by	
  the	
  length	
  of	
  S	
  and	
  mul,ply	
  by	
  100.	
  
3.	
  Apply	
  thresholds	
  to	
  blast	
  result.	
  Keep	
  matches	
  with	
  E-­‐value	
  <	
  1e-­‐5,	
  percent	
  match	
  length	
  >=	
  50%.	
  
4.	
  Find	
  poten,al	
  inparalog,	
  ortholog	
  and	
  co-­‐ortholog	
  pairs	
  using	
  the	
  Orthomcl	
  Pairs	
  program	
  (These	
  are	
  the	
  
pairs	
  that	
  are	
  counted	
  to	
  form	
  the	
  Average	
  %	
  Connec,vity	
  sta,s,c	
  per	
  group).	
  
5.	
  User	
  the	
  MCL	
  program	
  to	
  cluster	
  the	
  pairs	
  into	
  groups.	
  
	
  
	
  
orthomclResults/	
  
1.  orthologGroups	
  	
  	
  	
  a	
  map	
  between	
  your	
  proteins	
  and	
  OrthoMCL	
  groups.	
  	
  
2.  paralogPairs	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  reciprocal	
  best	
  hits	
  among	
  those	
  proteins	
  in	
  your	
  genome	
  
3.  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  that	
  were	
  not	
  mapped	
  to	
  OrthoMCL	
  groups	
  
4.  paralogGroups	
  	
  	
  	
  	
  	
  	
  the	
  proteins	
  in	
  paralogPairs	
  clustered	
  into	
  groups	
  by	
  the	
  mcl	
  program	
  
OrthoMCL	
  analysis	
  
orthologGroups	
  	
  
your_protein,	
  	
  	
  orthomcl_group,	
  	
  	
  seq_id_of_best_hit,	
  	
  	
  evalue_man7ssa,	
  	
  	
  evalue_exponent,	
  	
  	
  percent_iden7ty,	
  	
  	
  	
  percent_match	
  
•  Downloaded	
  the	
  “category”	
  ,	
  “species	
  name”,	
  and	
  
“abbrevia,on”	
  info	
  from	
  the	
  website	
  
•  Used	
  perl	
  scripts	
  to	
  add	
  the	
  corresponding	
  species	
  
name	
  and	
  category	
  to	
  the	
  orthologousGroups	
  file	
  
•  Calculated	
  #	
  of	
  orthologous	
  groups	
  in	
  each	
  category	
  
OrthoMCL	
  analysis	
  
category	
   abbrevia,on	
  
Archaea	
   ARCH	
  
Bacteria	
   FIRM	
  
Bacteria	
   OBAC	
  
Bacteria	
   PROT	
  
Fungi	
   FUNG	
  
Metazoa	
   META	
  
other	
  Eukaryota	
   OEUK	
  
Pro,st	
   ALVE	
  
Pro,st	
   AMOE	
  
Pro,st	
   EUGL	
  
Viridiplantae	
   VIRI	
  
Orthologous	
  groups	
  
category	
  
abbrevia
,on	
  
Archaea	
   ARCH	
  
Bacteria	
   FIRM	
  
Bacteria	
   OBAC	
  
Bacteria	
   PROT	
  
Fungi	
   FUNG	
  
Metazoa	
   META	
  
other	
  
Eukaryota	
  
OEUK	
  
Pro,st	
   ALVE	
  
Pro,st	
   AMOE	
  
Pro,st	
   EUGL	
  
Viridiplantae	
   VIRI	
  
FIRM:	
  Firmicutes	
  
OBAC:	
  Other	
  Bacteria	
  
PROT:	
  Proteobacteria	
  
Bacteria	
   Pro,st	
  
Sample	
  A	
  Sample	
  B	
   Sample	
  A+B	
  
SEQUENCING	
  LIBRARIES	
  ON	
  
GENOME	
  ASSEMBLY	
  
Assembly	
  Using	
  ABySS	
  
•  MP	
  libraries	
  improved	
  the	
  assembly	
  
Libraries k-mer
total# of
contigs
#contigs>=
500bp
#contigs>
N50
N50 max sum #N
Paired-end reads only
New PE (4 libraries) 75 500,224 75,165 9,009 11,748 106,374 367,800,000 281,121
New PE (6 libraries) 75 504,588 74,671 8,886 11,911 106,374 367,800,000 297,396
Paired-end and mate pair reads
New PE (4 libraries) +
MP (2 libraries)
75 168,163 31,320 3,088 37,026 289,426 401,000,000 281,121
New PE (6 libraries) +
MP (2 libraries)
75 171,733 29,974 3,000 38,350 274,863 401,200,000 297,396
OTHER	
  THINGS	
  
SRA	
  
•  Reads	
  from	
  DRR004446.sra	
  and	
  DRR004447.sra	
  are	
  exactly	
  the	
  same	
  
•  #	
  	
  	
  	
  	
  	
  	
  Run	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  #	
  of	
  Spots 	
   	
  #	
  of	
  Bases 	
  	
  Size	
  
•  1. 	
  DRR004446 	
  14,841,025 	
  2.7G	
  	
  1.5Gb	
  
•  #	
  Run 	
  #	
  of	
  Spots 	
  #	
  of	
  Bases 	
  Size	
  
•  1. 	
  DRR004447 	
  14,841,025 	
  2.7G	
  1.5Gb	
  
Take	
  home	
  message	
  
•  You	
  can’t	
  be	
  over	
  cau,ous	
  with	
  NGS	
  data!	
  
•  Always	
  do	
  QC	
  before	
  further	
  analysis!	
  
hpp://en.wikipedia.org/wiki/DNA_sequencing	
  

More Related Content

What's hot

RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
University of California, Davis
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Jan Aerts
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
Denis C. Bauer
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
basepairtech
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
Gunnar Rätsch
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
Joachim Jacob
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Joachim Jacob
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
Aureliano Bombarely
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineHai-Wei Yen
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
Jatinder Singh
 
Rna seq
Rna seq Rna seq
Rna seq
Amitha Dasari
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
Sean Davis
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
Torsten Seemann
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
Ramya P
 

What's hot (20)

RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then someRNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
presentation
presentationpresentation
presentation
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis Pipeline
 
Rna seq
Rna seqRna seq
Rna seq
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Rna seq
Rna seq Rna seq
Rna seq
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 

Similar to 2015.04.08-Next-generation-sequencing-issues

Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
Mark Pallen
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
Ravi Gandham
 
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
Thermo Fisher Scientific
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
Ravi Gandham
 
Primer designing
Primer designingPrimer designing
Primer designing
Ravi Gandham
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
Monica Munoz-Torres
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
GenomeInABottle
 
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdfDETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
sunilsuriya1
 
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
Spencer Bliven
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
hansjansen9999
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
Nathan Olson
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
hansjansen9999
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
Alex Clark
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory

Similar to 2015.04.08-Next-generation-sequencing-issues (20)

Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
600 base reads on the Ion S5™ Next-Generation Sequencing System enables accur...
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Primer designing
Primer designingPrimer designing
Primer designing
 
MGG2003-cDNA-AFLP
MGG2003-cDNA-AFLPMGG2003-cDNA-AFLP
MGG2003-cDNA-AFLP
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdfDETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
DETECTION OF BACTERIAL PLANT PATHOGENS BY SEROLOGICAL METHODS 2.pdf
 
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
2018-05-24 Research update on Armadillo Repeat Proteins: Evolution and Design...
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 

2015.04.08-Next-generation-sequencing-issues

  • 1. Learn  from  Prac,ce   -­‐What  Tells  You  about  a  Problema,c   NGS   Dongyan   Postdoctoral  Research  Associate   Buell  Lab/Jiang  Lab   2015.4.8  
  • 2. Sources  affec,ng  NGS   1.  Systema,c  varia,on  in  quality  scores  across  the   sequence  read   2.  Quality  trimming  and  cleaning  of  raw  reads   3.  Biases  in  sequence  genera,on  driven  by  base   composi,on   4.  Contamina,on  from  known  and  unknown  species   other  than  the  sequencing  target   5.  NGS  libraries  on  assembly  quality   6.  others   7.  ………………………………….  
  • 4. Per  base  quality  score   Forward  reads   Reverse  reads   Sample  A   Sample  B  
  • 5. 200  bp   300  bp   400  bp   500  bp   700  bp   800  bp   Library  QC  using  Bioanalyzer     Sample  A   Sample  B   Adapted  from  the  report  generated  by  Emily  Crisovan  (Buell  lab)    
  • 6. Cause  for  the  poor  base  quality  for   Sample  B   Illumina  flowcells  may  not  handle  longer  fragments  well   Bronner  et  al.,  2009  
  • 7. diol diol 1st cycle denaturation 1st cycle annealing diol diol n=35 total 1st cycle extension diol diol diol diol 2nd cycle denaturation 2nd cycle annealing dioldiol diol Cluster  Genera7on:  Amplifica7on   diol dioldiol 2nd cycle extension Adapted  from  Robin’s  slides  
  • 8. Per  base  sequence  quality   Before  cleaning   Aaer  cleaning   Good  base  quality  is  the  start.  
  • 9. QUALITY  TRIMMING  AND  ADAPTER   REMOVING  OF  RAW  READS  
  • 10. k-­‐mer  content  (residual  adapter   sequences)  –paired-­‐end  reads   Before  cleaning   Aaer  cleaning   •  Only  happened  to  paired-­‐end  libraries  with  small  insert  size  (<400  bp).   •  Not  happen  to  paired-­‐end  libraries  with  insert  size  greater  than  400  bp.  
  • 11. k-­‐mer  content   •  This  is  due  to  the  ‘reading  through’  a  short   fragment  into  the  adapter  sequence  on  the  other   end.   •  The  default  threshold  of  the  clip  is  too  high?   •  ILLUMINACLIP:TruSeq3-­‐PE.fa:2:30:10  
  • 12. k-­‐mer  content  (residual  adapter   sequences)  –mate  pair  reads   Aaer  cleaning  and  grouping  reads  to  categories  using  NextClip   •  Those  k-­‐mers  are  from  the  junc,on  adapter  
  • 13. k-­‐mer  content   •  Didn’t  want  to  lower  down  the  threshold  in   case  it  may  clip  more  than  necessary   •  Used  cutadapt  and  its  default  selng  to   remove  the  residual  adapter  sequences  aaer   trimmoma,c  and/or  NextClip  cleaning  
  • 14. Residual  adapter  on  assembly   w/  residual   adapter   w/o  residual   adapter   Never  rush  to  assembly  before  you  are  sure  you  have  a  high-­‐quality  and  ‘clean’  read  sets!  
  • 15. BIASES  IN  SEQUENCE  GENERATION   DRIVEN  BY  BASE  COMPOSITION  
  • 16. Biases  in  sequence  genera,on     Paired  end  reads   GC%:  33%   Mate  pair  reads   GC%:  40%  
  • 18. Per  sequence  GC  content   SGA  preQC   Sample  A   Sample  B   Contamina,ons?  
  • 19. QC   •  Map  reads  back  to  the  assembly   •  Taxon-­‐Annotated  Gene  Content   •  MAKER  annota,on  of  the  assembly   •  OrthoMCL  analysis  
  • 20. Mapping  reads  to  the  assemblies   •  Assembled  reads  using  ABySS   •  Map  reads  back  to  the  assembly  using  Bow,e/ 1.0.0  in  single  end  mode  allowing  1  mismatch       Sample  B  assembly   reads   mapped   unmapped   Sample  A   73.37%   26.63%   Sample  B   60.94%   39.06%   Contamina,ons?  
  • 21. TAGC   Sample  A   Sample  B   hpps://github.com/blaxterlab/blobology   hpps://github.com/mojones/blobsplorer    
  • 22. TAGC-­‐highlighted  a  phylum   Sample  A   Streptophyta  
  • 23. TAGC-­‐highlighted  a  phylum   Sample  B   Proteobacteria  Streptophyta  
  • 24. Maker  annota,on    Data  used   #con,g>1000bp   Sample  A  assembly    75,417   Sample  B  assembly   92833   •  EST  evidence   •  caa_assembly.fasta  (Elsa)   •  Protein  homology  evidence:   •  uniprot_sprot_plants.fasta     •  TAIR10_pep_20110103_representa,ve_gene_model   •  Repeat  masking-­‐default       Sample  A   Sample  B   Num_of_transcripts    31,234      45,791     Max_len_trans    14,796      29,577     Min_len_trans    28      33     N50    17,253,963      27,945,180     N50  transcript  size    1,409      1,498     Average  transcript  size    1,105      1,221     With  help  from  Kevin  Childs  
  • 25. OrthoMCL  analysis   •  OrthoMCL  DB  (web-­‐based)   –  hpp://www.orthomcl.org/orthomcl/     –  search  against  predefined  sets  of  orthologous  groups  from  a  set  of   organisms  
  • 26. OrthoMCL  analysis   Steps:   1.  All-­‐vs-­‐all  BLASTP  of  the  proteins   2.  Compute  percent  match  length        -­‐  Select  whichever  is  shorter,  the  query  or  subject  sequence.  Call  that  sequence  S.        -­‐  Count  all  amino  acids  in  S  that  par,cipate  in  any  HSP.        -­‐  Divide  that  count  by  the  length  of  S  and  mul,ply  by  100.   3.  Apply  thresholds  to  blast  result.  Keep  matches  with  E-­‐value  <  1e-­‐5,  percent  match  length  >=  50%.   4.  Find  poten,al  inparalog,  ortholog  and  co-­‐ortholog  pairs  using  the  Orthomcl  Pairs  program  (These  are  the   pairs  that  are  counted  to  form  the  Average  %  Connec,vity  sta,s,c  per  group).   5.  User  the  MCL  program  to  cluster  the  pairs  into  groups.       orthomclResults/   1.  orthologGroups        a  map  between  your  proteins  and  OrthoMCL  groups.     2.  paralogPairs                      reciprocal  best  hits  among  those  proteins  in  your  genome   3.                                                                   that  were  not  mapped  to  OrthoMCL  groups   4.  paralogGroups              the  proteins  in  paralogPairs  clustered  into  groups  by  the  mcl  program  
  • 27. OrthoMCL  analysis   orthologGroups     your_protein,      orthomcl_group,      seq_id_of_best_hit,      evalue_man7ssa,      evalue_exponent,      percent_iden7ty,        percent_match   •  Downloaded  the  “category”  ,  “species  name”,  and   “abbrevia,on”  info  from  the  website   •  Used  perl  scripts  to  add  the  corresponding  species   name  and  category  to  the  orthologousGroups  file   •  Calculated  #  of  orthologous  groups  in  each  category  
  • 28. OrthoMCL  analysis   category   abbrevia,on   Archaea   ARCH   Bacteria   FIRM   Bacteria   OBAC   Bacteria   PROT   Fungi   FUNG   Metazoa   META   other  Eukaryota   OEUK   Pro,st   ALVE   Pro,st   AMOE   Pro,st   EUGL   Viridiplantae   VIRI  
  • 29. Orthologous  groups   category   abbrevia ,on   Archaea   ARCH   Bacteria   FIRM   Bacteria   OBAC   Bacteria   PROT   Fungi   FUNG   Metazoa   META   other   Eukaryota   OEUK   Pro,st   ALVE   Pro,st   AMOE   Pro,st   EUGL   Viridiplantae   VIRI   FIRM:  Firmicutes   OBAC:  Other  Bacteria   PROT:  Proteobacteria   Bacteria   Pro,st   Sample  A  Sample  B   Sample  A+B  
  • 30. SEQUENCING  LIBRARIES  ON   GENOME  ASSEMBLY  
  • 31. Assembly  Using  ABySS   •  MP  libraries  improved  the  assembly   Libraries k-mer total# of contigs #contigs>= 500bp #contigs> N50 N50 max sum #N Paired-end reads only New PE (4 libraries) 75 500,224 75,165 9,009 11,748 106,374 367,800,000 281,121 New PE (6 libraries) 75 504,588 74,671 8,886 11,911 106,374 367,800,000 297,396 Paired-end and mate pair reads New PE (4 libraries) + MP (2 libraries) 75 168,163 31,320 3,088 37,026 289,426 401,000,000 281,121 New PE (6 libraries) + MP (2 libraries) 75 171,733 29,974 3,000 38,350 274,863 401,200,000 297,396
  • 33. SRA   •  Reads  from  DRR004446.sra  and  DRR004447.sra  are  exactly  the  same   •  #              Run                        #  of  Spots    #  of  Bases    Size   •  1.  DRR004446  14,841,025  2.7G    1.5Gb   •  #  Run  #  of  Spots  #  of  Bases  Size   •  1.  DRR004447  14,841,025  2.7G  1.5Gb  
  • 34. Take  home  message   •  You  can’t  be  over  cau,ous  with  NGS  data!   •  Always  do  QC  before  further  analysis!   hpp://en.wikipedia.org/wiki/DNA_sequencing