Primary analysis tutorial depracated

Bioinforma)cs
Primary
Analysis

Tutorial

Phil
Richmond,
PRA

Dowell
Lab

University
of
Colorado,
Biofron)ers

Ins)tute

Outline

•  Intro

– Things
that
will
be
covered

– Things
that
won’t
be
covered

•  Workﬂow

•  Mapping
with
Bow)e

•  File
Conversion
with
Samtools

•  Visualiza)on
with
IGV

•  Extras

Sequencing

•  There
are
many
diﬀerent
types
of
sequencing

including
454,
Illumina,
SOLiD,
IonTorrent,
and

more.

•  If
you
are
interested
in
each
type
of

sequencing…

Things
that
will
be
covered

•  The
primary
analysis
that
I
will
walk
through
is

a
“bare
bones”
analysis,
meant
to
take
your

reads
from
Illumina
sequencer
to
visualizer,
as

well
as
some
organiza)onal
prac)ces

– Mapping
(Bow)e/BWA)

– File
format
conversion

– Visualiza)on

Things
that
won’t
be
covered

•  Post/preprocessing
steps
that
I’m
leaving
out
include:

–  FastX
analysis
of
raw
reads
and
adapter
clipping,
etc.

–  PCR
duplicate
marking
(Illumina)
on
raw
reads

–  Base
Quality
Score
Recalibra)on
(GATK)
on
mapped
reads

–  Local
Realignment
around
indels
on
mapped
reads

•  Any
Secondary
or
Ter)ary
analysis
or
scrip)ng

techniques

–  Secondary
analysis
by
personal
appt.

–  Scrip)ng
techniques
by
joining
Dave
Knox’s
python
class

Login
to
Tuxedo

•  Login
with
–X
op)on
to
open
X11
viewer.

•  On
a
PC…see
me
for
separate
instruc)ons
to

pipe
visualiza)on

•  ssh
–X
richmonp@tuxedo.colorado.edu

Working
Directory

•  We
will
be
working
in
/data/Tutorial/<Student>

–  cd
/data/Tutorial/Phil/

•  The
necessary
files
for
the
tutorial
are
in
/data/
Tutorial/Files/

–  Parent113010.fa
is
the
reference
(e.
coli)
genome

–  Parent120710.gff
is
the
annota)on
file

–  Sample1_single.fastq
is
the
reads
file
we
are
working

with

Organiza)on

•  In
your
own
directory
(/data/Tutorial/
<Student>/)
create
the
following
sub-‐
directories:

– Genome/

•  Keep
the
fasta
and
gff
files
here

– Bow)e/

•  Keep
the
Bow)e
alignments,
and
post-‐processing
of

bow)e
alignments
here

– Fastq/

•  Keep
the
raw
fastq
files
here

Workﬂow
Raw
Reads
(Fastq)

Mapped
Reads
(SAM)

Mapping
(Bow)e)

Binary
Mapped
Reads

(SORTED.BAM)

File
Conversion
(SAMTOOLS)

Visualiza)on
(IGV)

Fastq
file

•  File
extension
.fastq
or
.fq

•  Example:

@Read_iden)fier_and_flowcell_info

ACGTCCGGTTNNN…

+

B$!?NP[%&C…

•  For
more
info
on
ASCII
encoding
QV
scores…
go
to
wikipedia

Read
ID

Read
Sequence

Read
QV
ID

Read
QV
Sequence

Mapping
the
Short
Reads

•  Taking
each
read
and
mapping
it
to
a

reference
genome

– Bow)e

TGCATGCATGCATGCATGCATGCATGCATGCATGCAAAAAGCATGCATGCA

TGCATGAATGCAAAAAGCATGCA

Bow)e-‐Build
Command

•  In
order
to
map
the
reads
to
a
genome,
you

must
acquire
the
genome
in
the
.fasta
(.fa)

format,
and
then
index
it.

•  bow)e-‐build
-‐f
<in.fasta>
<out_preﬁx>

– $bow)e-‐build
SGDv4.fasta
SGDv4_bow)e

Bow)e
command

•  Now
we
map
back
to
the
reference
we
just

indexed.

•  bow)e
<reference_in.preﬁx>
-‐q
<in.fastq>
-‐S

<out.SAM>
2>
<out.stderr>

– $
bow)e
/data/Tutorial/Phil/Genome/
Bow)e_index/SGDv3_bow)e
–q
Sample1.fastq
–S

Sample1_
bow)e.sam
2>
Sample1_bow)e.stderr

Sam
File

•  Tab
Delimited

•  hup://genome.sph.umich.edu/wiki/SAM

•  Open
Example
SAM

Samtools
Commands

•  samtools
view
–bS
<in.sam>
-‐o
<out.bam>

– $samtools
view
–bS
Sample1_bow)e.sam
–o

Sample1_bow)e.bam

•  samtools
sort
<in.bam>
<out.sorted>

– $samtools
sort
Sample1_bow)e.bam

Sample1_bow)e.sorted

•  samtools
index
<in.sorted.bam>

– $samtools
index
Sample1_bow)e.sorted.bam

IGV

•  Located
at
/data2/IGV/

•  Several
diﬀerent
versions
available,

recommend
either:

• 
/data2/IGV/IGV_2.1.19/igv.jar

•  /data2/IGV/IGV_1.5.64/igv.jar

•  To
run
IGV:

– java
–Xmx5g
–jar
<igv.jar>

•  $java
–Xmx5g
–jar
/data2/IGV/IGV_1.5.64/igv.jar
&

IGV:
Crea)ng
a
genome

•  Reference
Instruc)ons
on
sheet.

Bow)e
and
Bfast
IGV

Bow$e

Bfast

Gene

Advantages
to
Bfast
Gapped
Mapping

Bow$e

Bfast

Gene

Bfast
Mapping
Loosely

Bow$e

Bfast

Gene

If
you
are
gexng
the
hang
of
it

quickly…

•  Try
going
through
the
next
few
commands

BWA
Paired
end

•  /usr/local/src/bwa-‐0.6.2/bwa
index
–a
is
–f
<in.fasta>

•  Map
each
read
in
the
pair
independently

aln
<reference.prefix>

<in_1.fq>
>
<out.sai>

•  Finalize
the
mapping
by
conver)ng
(for
both
reads)

both
the
.SAI
and
the
.FQ
into
a
final
SAM
alignment:

sampe

<reference.prefix>
<in_1.sai>
<in_2.sai>
<in_1.fq>

<in_2.fq>
>
<out_paired.sam>

Bow)e
Unique
Mapping

•  Inves)gate
the
diﬀerent
Bow)e
op)ons:

– Look
at
–m
(number
of
mappings
per
read),
-‐v

(number
of
mismatches
per
seed)

TopHat
Spliced
Mapping

•  /usr/local/src/tophat-‐2.0.4.Linux_x86_64/
tophat
–G
<in.gﬀ>

-‐o
<output_directory>

<bow)e_index>
<in.fastq>

Primary analysis tutorial depracated

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Primary analysis tutorial depracated

Similar to Primary analysis tutorial depracated (20)

Recently uploaded

Recently uploaded (20)

Primary analysis tutorial depracated