2015.04.08-Next-generation-sequencing-issues

Learn
from
Prac,ce

-‐What
Tells
You
about
a
Problema,c

NGS

Dongyan

Postdoctoral
Research
Associate

Buell
Lab/Jiang
Lab

2015.4.8

Sources
aﬀec,ng
NGS

1.  Systema,c
varia,on
in
quality
scores
across
the

sequence
read

2.  Quality
trimming
and
cleaning
of
raw
reads

3.  Biases
in
sequence
genera,on
driven
by
base

composi,on

4.  Contamina,on
from
known
and
unknown
species

other
than
the
sequencing
target

5.  NGS
libraries
on
assembly
quality

6.  others

7.  ………………………………….

BASE
SEQUENCING
QUALITY

Per
base
quality
score

Forward
reads
Reverse
reads

Sample
A

Sample
B

200
bp

300
bp

400
bp

500
bp

700
bp

800
bp

Library
QC
using
Bioanalyzer

Sample
A

Sample
B

Adapted
from
the
report
generated
by
Emily
Crisovan
(Buell
lab)

Cause
for
the
poor
base
quality
for

Sample
B

Illumina
ﬂowcells
may
not
handle
longer
fragments
well
Bronner
et
al.,
2009

diol diol
1st cycle
denaturation
1st cycle
annealing
diol diol
n=35
total
1st cycle
extension
diol diol diol diol
2nd cycle
denaturation
2nd cycle
annealing
dioldiol diol
Cluster
Genera7on:
Ampliﬁca7on

diol dioldiol
2nd cycle
extension
Adapted
from
Robin’s
slides

Per
base
sequence
quality

Before
cleaning
Aaer
cleaning

Good
base
quality
is
the
start.

QUALITY
TRIMMING
AND
ADAPTER

REMOVING
OF
RAW
READS

k-‐mer
content
(residual
adapter

sequences)
–paired-‐end
reads

Before
cleaning
Aaer
cleaning

•  Only
happened
to
paired-‐end
libraries
with
small
insert
size
(<400
bp).

•  Not
happen
to
paired-‐end
libraries
with
insert
size
greater
than
400
bp.

k-‐mer
content

•  This
is
due
to
the
‘reading
through’
a
short

fragment
into
the
adapter
sequence
on
the
other

end.

•  The
default
threshold
of
the
clip
is
too
high?

•  ILLUMINACLIP:TruSeq3-‐PE.fa:2:30:10

k-‐mer
content
(residual
adapter

sequences)
–mate
pair
reads

Aaer
cleaning
and
grouping
reads
to
categories
using
NextClip

•  Those
k-‐mers
are
from
the
junc,on
adapter

k-‐mer
content

•  Didn’t
want
to
lower
down
the
threshold
in

case
it
may
clip
more
than
necessary

•  Used
cutadapt
and
its
default
selng
to

remove
the
residual
adapter
sequences
aaer

trimmoma,c
and/or
NextClip
cleaning

Residual
adapter
on
assembly

w/
residual

adapter

w/o
residual

adapter

Never
rush
to
assembly
before
you
are
sure
you
have
a
high-‐quality
and
‘clean’
read
sets!

BIASES
IN
SEQUENCE
GENERATION

DRIVEN
BY
BASE
COMPOSITION

Biases
in
sequence
genera,on

Paired
end
reads

GC%:
33%

Mate
pair
reads

GC%:
40%

Per
sequence
GC
content

SGA
preQC

Sample
A
Sample
B

Contamina,ons?

QC

•  Map
reads
back
to
the
assembly

•  Taxon-‐Annotated
Gene
Content

•  MAKER
annota,on
of
the
assembly

•  OrthoMCL
analysis

Mapping
reads
to
the
assemblies

•  Assembled
reads
using
ABySS

•  Map
reads
back
to
the
assembly
using
Bow,e/
1.0.0
in
single
end
mode
allowing
1
mismatch

Sample
B
assembly

reads
mapped
unmapped

Sample
A
73.37%
26.63%

Sample
B
60.94%
39.06%

Contamina,ons?

TAGC

Sample
A
Sample
B

hpps://github.com/blaxterlab/blobology

hpps://github.com/mojones/blobsplorer

TAGC-‐highlighted
a
phylum

Sample
A

Streptophyta

TAGC-‐highlighted
a
phylum

Sample
B

Proteobacteria
Streptophyta

Maker
annota,on

Data
used
#con,g>1000bp

Sample
A
assembly

75,417

Sample
B
assembly
92833

•  EST
evidence

•  caa_assembly.fasta
(Elsa)

•  Protein
homology
evidence:

•  uniprot_sprot_plants.fasta

•  TAIR10_pep_20110103_representa,ve_gene_model

•  Repeat
masking-‐default

Sample
A
Sample
B

Num_of_transcripts

31,234

45,791

Max_len_trans

14,796

29,577

Min_len_trans

28

33

N50

17,253,963

27,945,180

N50
transcript
size

1,409

1,498

Average
transcript
size

1,105

1,221

With
help
from
Kevin
Childs

OrthoMCL
analysis

•  OrthoMCL
DB
(web-‐based)

–  hpp://www.orthomcl.org/orthomcl/

–  search
against
predeﬁned
sets
of
orthologous
groups
from
a
set
of

organisms

OrthoMCL
analysis

Steps:

1.
All-‐vs-‐all
BLASTP
of
the
proteins

2.
Compute
percent
match
length

-‐
Select
whichever
is
shorter,
the
query
or
subject
sequence.
Call
that
sequence
S.

-‐
Count
all
amino
acids
in
S
that
par,cipate
in
any
HSP.

-‐
Divide
that
count
by
the
length
of
S
and
mul,ply
by
100.

3.
Apply
thresholds
to
blast
result.
Keep
matches
with
E-‐value
<
1e-‐5,
percent
match
length
>=
50%.

4.
Find
poten,al
inparalog,
ortholog
and
co-‐ortholog
pairs
using
the
Orthomcl
Pairs
program
(These
are
the

pairs
that
are
counted
to
form
the
Average
%
Connec,vity
sta,s,c
per
group).

5.
User
the
MCL
program
to
cluster
the
pairs
into
groups.

orthomclResults/

1.  orthologGroups

a
map
between
your
proteins
and
OrthoMCL
groups.

2.  paralogPairs

reciprocal
best
hits
among
those
proteins
in
your
genome

3. 

that
were
not
mapped
to
OrthoMCL
groups

4.  paralogGroups

the
proteins
in
paralogPairs
clustered
into
groups
by
the
mcl
program

OrthoMCL
analysis

orthologGroups

your_protein,

orthomcl_group,

seq_id_of_best_hit,

evalue_man7ssa,

evalue_exponent,

percent_iden7ty,

percent_match

•  Downloaded
the
“category”
,
“species
name”,
and

“abbrevia,on”
info
from
the
website

•  Used
perl
scripts
to
add
the
corresponding
species

name
and
category
to
the
orthologousGroups
ﬁle

•  Calculated
#
of
orthologous
groups
in
each
category

OrthoMCL
analysis

category
abbrevia,on

Archaea
ARCH

Bacteria
FIRM

Bacteria
OBAC

Bacteria
PROT

Fungi
FUNG

Metazoa
META

other
Eukaryota
OEUK

Pro,st
ALVE

Pro,st
AMOE

Pro,st
EUGL

Viridiplantae
VIRI

Orthologous
groups

category

abbrevia
,on

Archaea
ARCH

Bacteria
FIRM

Bacteria
OBAC

Bacteria
PROT

Fungi
FUNG

Metazoa
META

other

Eukaryota

OEUK

Pro,st
ALVE

Pro,st
AMOE

Pro,st
EUGL

Viridiplantae
VIRI

FIRM:
Firmicutes

OBAC:
Other
Bacteria

PROT:
Proteobacteria

Bacteria
Pro,st

Sample
A
Sample
B
Sample
A+B

SEQUENCING
LIBRARIES
ON

GENOME
ASSEMBLY

Assembly
Using
ABySS

•  MP
libraries
improved
the
assembly

Libraries k-mer
total# of
contigs
#contigs>=
500bp
#contigs>
N50
N50 max sum #N
Paired-end reads only
New PE (4 libraries) 75 500,224 75,165 9,009 11,748 106,374 367,800,000 281,121
New PE (6 libraries) 75 504,588 74,671 8,886 11,911 106,374 367,800,000 297,396
Paired-end and mate pair reads
New PE (4 libraries) +
MP (2 libraries)
75 168,163 31,320 3,088 37,026 289,426 401,000,000 281,121
New PE (6 libraries) +
MP (2 libraries)
75 171,733 29,974 3,000 38,350 274,863 401,200,000 297,396

SRA

•  Reads
from
DRR004446.sra
and
DRR004447.sra
are
exactly
the
same

•  #

Run

#
of
Spots

#
of
Bases

Size

•  1.
DRR004446
14,841,025
2.7G

1.5Gb

•  #
Run
#
of
Spots
#
of
Bases
Size

•  1.
DRR004447
14,841,025
2.7G
1.5Gb

Take
home
message

•  You
can’t
be
over
cau,ous
with
NGS
data!

•  Always
do
QC
before
further
analysis!

hpp://en.wikipedia.org/wiki/DNA_sequencing

2015.04.08-Next-generation-sequencing-issues

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2015.04.08-Next-generation-sequencing-issues

Similar to 2015.04.08-Next-generation-sequencing-issues (20)

2015.04.08-Next-generation-sequencing-issues