2. Outline
• Intro
– Things
that
will
be
covered
– Things
that
won’t
be
covered
• Workflow
• Mapping
with
Bow)e
• File
Conversion
with
Samtools
• Visualiza)on
with
IGV
• Extras
3. Sequencing
• There
are
many
different
types
of
sequencing
including
454,
Illumina,
SOLiD,
IonTorrent,
and
more.
• If
you
are
interested
in
each
type
of
sequencing…
4. Things
that
will
be
covered
• The
primary
analysis
that
I
will
walk
through
is
a
“bare
bones”
analysis,
meant
to
take
your
reads
from
Illumina
sequencer
to
visualizer,
as
well
as
some
organiza)onal
prac)ces
– Mapping
(Bow)e/BWA)
– File
format
conversion
– Visualiza)on
5. Things
that
won’t
be
covered
• Post/preprocessing
steps
that
I’m
leaving
out
include:
– FastX
analysis
of
raw
reads
and
adapter
clipping,
etc.
– PCR
duplicate
marking
(Illumina)
on
raw
reads
– Base
Quality
Score
Recalibra)on
(GATK)
on
mapped
reads
– Local
Realignment
around
indels
on
mapped
reads
• Any
Secondary
or
Ter)ary
analysis
or
scrip)ng
techniques
– Secondary
analysis
by
personal
appt.
– Scrip)ng
techniques
by
joining
Dave
Knox’s
python
class
6. Login
to
Tuxedo
• Login
with
–X
op)on
to
open
X11
viewer.
• On
a
PC…see
me
for
separate
instruc)ons
to
pipe
visualiza)on
• ssh
–X
richmonp@tuxedo.colorado.edu
7. Working
Directory
• We
will
be
working
in
/data/Tutorial/<Student>
– cd
/data/Tutorial/Phil/
• The
necessary
files
for
the
tutorial
are
in
/data/
Tutorial/Files/
– Parent113010.fa
is
the
reference
(e.
coli)
genome
– Parent120710.gff
is
the
annota)on
file
– Sample1_single.fastq
is
the
reads
file
we
are
working
with
8. Organiza)on
• In
your
own
directory
(/data/Tutorial/
<Student>/)
create
the
following
sub-‐
directories:
– Genome/
• Keep
the
fasta
and
gff
files
here
– Bow)e/
• Keep
the
Bow)e
alignments,
and
post-‐processing
of
bow)e
alignments
here
– Fastq/
• Keep
the
raw
fastq
files
here
11. Fastq
file
• File
extension
.fastq
or
.fq
• Example:
@Read_iden)fier_and_flowcell_info
ACGTCCGGTTNNN…
+
B$!?NP[%&C…
• For
more
info
on
ASCII
encoding
QV
scores…
go
to
wikipedia
Read
ID
Read
Sequence
Read
QV
ID
Read
QV
Sequence
13. Mapping
the
Short
Reads
• Taking
each
read
and
mapping
it
to
a
reference
genome
– Bow)e
TGCATGCATGCATGCATGCATGCATGCATGCATGCAAAAAGCATGCATGCA
TGCATGAATGCAAAAAGCATGCA
14. Bow)e-‐Build
Command
• In
order
to
map
the
reads
to
a
genome,
you
must
acquire
the
genome
in
the
.fasta
(.fa)
format,
and
then
index
it.
• bow)e-‐build
-‐f
<in.fasta>
<out_prefix>
– $bow)e-‐build
SGDv4.fasta
SGDv4_bow)e
15. Bow)e
command
• Now
we
map
back
to
the
reference
we
just
indexed.
• bow)e
<reference_in.prefix>
-‐q
<in.fastq>
-‐S
<out.SAM>
2>
<out.stderr>
– $
bow)e
/data/Tutorial/Phil/Genome/
Bow)e_index/SGDv3_bow)e
–q
Sample1.fastq
–S
Sample1_
bow)e.sam
2>
Sample1_bow)e.stderr
16. Sam
File
• Tab
Delimited
• hup://genome.sph.umich.edu/wiki/SAM
• Open
Example
SAM
25. If
you
are
gexng
the
hang
of
it
quickly…
• Try
going
through
the
next
few
commands
26. BWA
Paired
end
• /usr/local/src/bwa-‐0.6.2/bwa
index
–a
is
–f
<in.fasta>
• Map
each
read
in
the
pair
independently
• /usr/local/src/bwa-‐0.6.2/bwa
aln
<reference.prefix>
<in_1.fq>
>
<out.sai>
• Finalize
the
mapping
by
conver)ng
(for
both
reads)
both
the
.SAI
and
the
.FQ
into
a
final
SAM
alignment:
• /usr/local/src/bwa-‐0.6.2/bwa
sampe
<reference.prefix>
<in_1.sai>
<in_2.sai>
<in_1.fq>
<in_2.fq>
>
<out_paired.sam>
27. Bow)e
Unique
Mapping
• Inves)gate
the
different
Bow)e
op)ons:
– Look
at
–m
(number
of
mappings
per
read),
-‐v
(number
of
mismatches
per
seed)