20110524zurichngs 1st pub

Next Generation Sequencing for
Model and Non-Model Organism.
1st day
Jun Sese and Kentaro Shimizu
sesejun@cs.titech.ac.jp

Ph.D course @ Univ. of Zurich
25/05/2011

Today’s Menu
• Lecture
• Overview of next generation sequencer’s analysis
• Mapping: Sequence alignment
• Introduction to UNIX to handle NGS data
• Exercise
• UNIX commands
• Mapping real short reads against genomes
• Compute statistics of the mapped reads

2

Various Types of Sequencers
• Roche 454, IonTorrent
• Roche: about 400bp, Ion Torrent: about 200bp
• Suitable for denovo sequencing
• Illumina HiSeq
• Widely-used new generation sequencer
• 100bpx2 up to 600 Gb/run (HiSeq 2000)
• MiSeq uses almost same technology except number
of reads
• ABI SOLiD
• 75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run
(5500xl SOLiD)
• Color Space
• Paciﬁc Biosciences PacBio RS
• Average > 500 bp
• Sequence quality is not high.
3

Sequence cost becomes low
dramatically

Lincoln Stein, Genome Biology, vol. 11(5), 2010
4

How large is it?

• Generated ﬁle size is more than 300GB/run
• We can read data from hard disks with 100 MB/sec
• 300GB / 100MB/sec
= 300,000MB / 100MB/sec
= 3000 sec
= 50min
• To just read the data from HDD, computer takes 50min!
• Require efﬁcient calculation

5

Applications of DNA Sequencing
• NGS just read enormous short sequences, but has
many biological applications.
• Genetic variation
• Gene regulations
• RNA-seq
• ChIP-seq
• Epigenetics
• Population genetics

Science 2007 6

Sequencerʼs Output

Genome Sequence

Mapping Program

Mapping Result

Visualization Further Analysis
SNPs, RNA-Seq,... 7

Major Pipelines of NGS
• Most of the applications use the similar procedure.

Genetic variation RNA-Seq ChIP-Seq
Find
originated Map Map Map
region (Alignment)
Check regulatory
Filter SNP call Measure expressions regions

Analysis Find difference Same as microarray Same as ChIP-
Chip analysis

Most of them require whole genome sequence to map reads.
8

Mapping (Pairwise Alignment)
• Find the place from which each read comes
• BLAST is one of the very famous alignment software.
• Few NGS analysis use BLAST/BLAT because of slow alignment
speed.
• BWA and Bowtie have been used to map short reads.
Reads ATATGCGA

ATATGCGA
Reference GATGCTAAGCATATGCGAGGCATGCCATATGGATG
We may ﬁnd multiple mapped places.
Score matrix (distance) deﬁnes which map is better.

Reads ATATGCGA

ATATGCGA ATATG-CGA
x
Reference GATGCTAAGCAAATGCGAGGCATGCCATATGGCGA 9

For non-model organism
Genetic Variation Chip-Seq RNA-Seq
Read normalized
Read genome Read genome library
Genome/Gene
Sequence Genome Genome RNA
assembly assembly Assembly
Map onto

Map new reads Map ChIP-Seq
related species
Map Count genome
reads assembled
reads Map new
RNA-Seq reads
Check regulatory
Filter SNP call
regions Measure expressions

Similar to
Analysis Find Difference Same as microarray
ChIP-Chip
Most cases require genome assembly,
which is experimentally and computationally high cost 11

Very Short History of
Pairwise Alignment Programs
• More than 100 alignment programs are listed in Wikipedia!!!
• http://en.wikipedia.org/wiki/Sequence_alignment_software
• 1 sequence vs 1 sequence
• Ssearch, FASTA [Lipman and Pearson. 1985]
• 1 sequence vs Whole genes
• BLAST [Altschul et al. 1990]
• Thousands of sequences vs Whole genes or Whole genomes
• BLAT [Kent. 2002]
• Billions of short sequences vs Whole genome
• BWA, Bowtie, SHRiMP, etc...
• Most modern mappers use FM-index [Ferragina and
Manzini. 2000] with Burrows-Wheeler transform [Burrows
and Wheeler. 1994]. 12

Why so many alignment
programs have been developed?
• Computer scientist seems that alignment is easy task.
• Both indexing and dynamic programming used in
sequence alignment are basic algorithm.
• Good problem for home work
• A little performance tuning can accelerates execution
speed dramatically
• In reality, alignment problem is very hard to solve.
• Mutations, insertions, deletions...
• Each sequencer has unique bias.
• Sequence length. Homo-polymer in Roche 454...
• Many heuristics exist in biologist!
• GT-AG rule on splice site, but not always...
• That is, problem deﬁnition is ambiguous! 13

Alignment performance varies
• Aligned 12million single end reads against human genome
sequences (hg18)
• Algorithm and implementation difference appear in total processed
time
• In most program, used memory depends on genome size.
• Parameter settings reﬂect numbers of mapped reads.
• Authors did not mention about them.
• In real experiments, we have to change parameters to use
alignment program.

Bao et al. J Hum Genet, 2011

14

Sequencerʼs Output
Sequence Format

Genome Sequence

Mapping Program BWA, Bowtie, etc.

Mapping Result

Visualization
15

Sequence File Format (1)
• FASTA + Quality File
• Used by Roche 454
>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_
GCGTTGTGTATGTCTCCTTTGGTATGTCAGGTTTCGTCAGAAGCTTCTATCAAACGGCGC
ACAGTGA
>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_
TCGGCCCTATCCGAGAAGGCGTGGTGTATCTCTCTTCTGGTATGCCACGTTACGCAGCAG
CTTCTTCCCAAGACACAGAGCGAGTAAG

>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_
37 35 35 35 35 35 37 37 37 37 37 39 39 37 36 35 35 36 37 37 37 37 35 35 32 28 27 27 27 27
29 23 21 21 14 14 12 18 19 19 19 19 19 19 16 16 17 20 22 20 12 12 12 12 11 17 17 17 16 19
22 23 24 21 21 21 18
>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_
29 30 19 19 19 20 19 24 28 27 27 27 27 27 30 19 19 20 20 20 24 33 33 33 33 33 33 33 35 35
37 37 30 30 30 30 32 32 32 32 35 32 32 32 32 33 33 33 33 20 20 20 23 27 30 30 31 31 27 27
27 27 28 23 24 24 23 23 23 24 24 21 17 19 19 18 27 18 17 16 16 16 17 13 18 17 16 12

16

Sequence File Format (2)
• FASTQ
• Used by Illumina sequencers
• Sequence database sites (SRA(Short read archive)/ENA
(European Nucleotide Archive)/DRA(DDBJ Sequence Read
Archive)) provide sequences with this format.
• De-facto standard
• CSFasta + Quality ﬁle
• Only used in SOLiD sequencers
• Similar to fasta ﬁle except sequences are described in color
space.
>SRR038985.100 VAB_AT1deg1_51_269_F3
T10303011231130321000333001323122221
>SRR038985.200 VAB_AT1deg1_78_430_F3
T03102101012320213012132121333132011

>SRR038985.100 VAB_AT1deg1_51_269_F3
0 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 23
19 26 28 9 22 18 21 25 25 23 2 20
>SRR038985.200 VAB_AT1deg1_78_430_F3
0 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 21
25 20 28 20 20 15 23 8 25 23 11 25 17

Color Space
• ABI SOLiD unique format.
• Each number represents two base pair
• Each nucleotide are in the SOLiD™ System: the Theory, Advantages and Solutions
Color Space Analysis read twice
• A spot detection miss may change downstream sequence.
• Introduction
The SOLiD™ System is the only next generationthis format.
Some softwares did not support
sequencing system to employ ligation based chemistry
2nd Base

with di-base labelled probes. This unique approach
provides significant advantages in terms of system

1st Base
accuracy and downstream data analysis.
T10303011
Unique built-in error checking capability
distinguishes between measurement errors and
true polymorphisms
Detection of more complicated genetic variation
TGGCCGGTG
such as adjacent SNPs, insertions, deletions and
structural variations Double Interrogation: Each base is defined twice

T10203011
Properties for a 2 Base Color Code Scheme
The color code scheme is based on the Klein four-
A T C A A
group, which is the symmetry group of a rectangle.
ABI White Paper: Figure 1: SOLiD Color Space Code
TGGAATTGT
It was designed to have the following properties which Color Space Analysis in the SOLiD
enable the unique error checking capability.
System: the Theory, Advantages and Solutions
18

FASTQ Format
One read
@SRR013343.216 :3:1:837:436 Name
GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT Sequence
+
I6IIII*II*II+I:+&I)I'&%&%,+0>+'I''$G Quality Score
@SRR013343.217 :3:1:974:526
GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC
+
I@II6I<I/III;II+)I*II*DI*I?')+*+8/%8
@SRR013343.218 :3:1:755:341
GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC
+
IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I'

19

PHRED quality encoding
−Q
Q = −10 log10 P ⇔ P = 10 10

• Q=20: 99% accuracy, Q=30: 99.9% accuracy
• Quality value scale is slightly different between PHRED
and illumina/SOLiD results
• Encoded in FASTQ and SAM by quality string of “ASCII
value - 33”
• For illumina 1.3+, ASCII character has been changed to
ASCII-64 character.

! 33 ‘ 39 - 45 3 51 9 57 ? 63 ...
“ 34 ( 40 . 46 4 52 : 58 @ 64 ...
# 35 ) 41 / 47 5 53 ; 59 A 65 ...
$ 36 * 42 0 48 6 54 < 60 B 66 ...
% 37 + 43 1 49 7 55 = 61 C 67 ...
& 38 , 44 2 50 8 56 > 62 D 68 ...
20

Sequencerʼs Output
Sequence Format

Genome Sequence

Mapping Program BWA, Bowtie, etc.

Mapping Result Output Format

Visualization
21

SAM Format

• Sequence Alignment / Map format
• Simple tab-delimited text ﬁle
• Standardized alignment output format
• Modern alignment tools support this format
• BAM format is binary version of SAM format.

@HD VN:1.0
@SQ! SN:chr20 LN:62435964
@RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
@RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195
AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<
NM:i:1 RG:Z:L1
read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168
ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<
MF:i:18 RG:Z:L2
22

Overview
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS>
<ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]

read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195
AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<
NM:i:1 RG:Z:L1

23

Flag

• Bitwise notation: computer friendly (human non-
friendly format :)
• 16 = 0x0010: mapped reverse strand
• 4 = 0x0004: unmapped
• 0 = 0x0000: mapped forward strand

24

CIGAR

• Show alignment result simply
• 8M9I7M
• 8bp match, 9bp insertion, and then 7bp match

8M 9I 7M
CATATGCG---------ATATGGA
|||||||| |||||||
GATGCTAAGCATATGCGAGGCATGCCATATGGATG

4th line “POS” indicates this position. 25

Summary
• No standard tools for analyzing NGS data
• QA sites are good resources
• SeqAnswers.com
• biostar.stackexchange.com
• Many algorithms and softwares have been
developed.
• See. http://www.oxfordjournals.org/our_journals/
bioinformatics/nextgenerationsequencing.html
• Most of them work with UNIX command line
• Few analysis tools with GUI
• Galaxy (Free, require server setup)
• BioScope (Only available with SOLiD sequencer)

26

Unix Commands
Sequencerʼs Output
Sequence Format

Genome Sequence

Performed Mapping Program BWA, Bowtie, etc.
with UNIX
commands
Mapping Result Output Format

Visualization
27

Preparation
• NGS procedure generate many files.
• Even in this lecture, we will generate 50 files.
• We use directory generated by extracting “ngslec.zip.”
• Extract the zip file in your home directory.
• To move to the directory, we type the following command
in Terminal

$ cd ngslec
$ pwd
/Users/YOUR_DIRECTORY/ngslec/

28

Use “Terminal”
• Operating System (OS) handle movements on computer.
• Read ﬁles, mouse click, visualize characters, ...
• We can use the OS functions through application “Terminal” on
UNIX OS
• Applications > Utilities > Terminal
• UNIX: Linux, IBM AIX, Sun OS, Mac OS X
• except Windows and Mac OS -9
• In the terminal, we can use shell commands.
• Applications consists of a procedure of the shell commands.
• A complicated program is made of a set of tiny programs.
• We start to learn usage of tiny programs, and then how to
combine them.

Kernel Shell Terminal
29

Command and Arguments
$ rm -r arg1 arg2

(A) Command (Order): run a command called “rm”
(B),(C) and (D) Arguments: separated by space character
between command and arguments and between arguments
(B) Arguments that change sub functions of the command are
called “Option.” Options starts from “-” or “--”
(C) First argument. We count argument number except options.
(D) Second argument.
30

Example: date command
• Input “date” + [Return] to show current time

• With option “-u”, “date” command shows
Coordinated Universal time.
• If you misspell command, terminal says “command
not found.”
• Commands (and ﬁle names) are case sensitive on
UNIX except Mac OS X.

31

File System
• You may always use this system through “Finder.” In this lecture,
we will use this from “Terminal.”
• Tree structure rooted by “/”
• USB memories and DVDs are also managed through ﬁle system.

/

usr Volume

bin lib pics

USB zurich
32

Directories and Files
• Current directory
/ • Directory on which you are working
• You can check “pwd” command.
usr Users • Home directory
* • Root (top) of your personal directory
bin lib sesejun
• Denoted by “~” or “$HOME”
• When your current directory is “/Users/
usr sesejun”
** • pwd command shows /Users/sesejun
lib
• /usr/lib indicates *
• usr/lib indicates **
• “.” is equal to “/Users/sesejun”
• .. is equal to /Users
• ../../usr/lib is equal to “/usr/lib”
33

cd: Change Directory
• cd destination-dir
• move your current directory to destination-dir
• When you omit (unset) arguments, move to home
dir.

jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$ cd /usr/
jsmbp:/usr sesejun$ pwd
/usr
jsmbp:/usr sesejun$ cd lib
jsmbp:/usr/lib sesejun$ pwd
/usr/lib
jsmbp:/usr/lib sesejun$ cd /usr/bin/
jsmbp:/usr/bin sesejun$ pwd
/usr/bin
jsmbp:/usr/bin sesejun$ cd
jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$ 34

ls (LiSt): Show List of Files
• Show current directory files when setting no arguments
• Important options
• -a: Show all files (Files starting from “.” do not appear
when we do not set this option)
• -l: Show detail information of files
• -h: Show file size in human friendly format (usually used
with option “-l”)
•
$ ls
Desktop Music largefile
$ ls -l
drwx------+ 8 sesejun staff 272 5 16 00:09 Desktop
drwx------+ 3 sesejun staff 102 10 27 2010 Movies
-rw-r--r-- 1 sesejun staff 4181139 5 16 08:20 largefile
$ ls -lh
drwx------+ 8 sesejun staff 272B 5 16 00:09 Desktop
drwx------+ 3 sesejun staff 102B 10 27 2010 Movies
-rw-r--r-- 1 sesejun staff 4.0M 5 16 08:20 largefile
35

cp: Copy Files
• cp [options] source-file ... directory
• cp [options] source-file new-file
• Options:
• Copy text1.txt to text2.txt
$ cp text1.txt text2.txt

• Copy text1.txt and text2.txt in “tmp” directory

$ cp text1.txt text2.txt tmp/
$ ls tmp
text1.txt text2.txt

36

mv: Move files
• Also used to change file names
• mv [options] source-file ... directory
• mv [options] old-path new-path
• Change filename text1.txt to text2.txt
$ mv text1.txt text2.txt

• Move text1.txt and text2.txt into tmp directory

$ mv text1.txt text2.txt tmp/
$ ls
tmp
$ ls tmp/
text1.txt text2.txt

37

rm (ReMove): Delete files
• Options:
• -r: Remove all the files in directory
• -i: Confirm before removing each file.
• Delete text1.txt and text2.txt
jsmbp:~ sesejun$ rm text1.txt text2.txt

• Delete all the files within tmp directory
• Note: These files are “really” removed. They never
go to “Trash.” We cannot use undo.
jsmbp:~/test sesejun$ ls
tmp
jsmbp:~/test sesejun$ ls tmp/
text1.txt text2.txt
jsmbp:~/test sesejun$ rm -r tmp/
jsmbp:~/test sesejun$ ls
jsmbp:~/test sesejun$ 38

Exercise (1)
• Run commands
• Run date and date -u, and check the results.
• Run command “cal” What is the result?
• Change directory
• Run examples in page “cd”
• Check make and remove directory
• Open your login name directory in Finder.
• Move your home directory in Terminal.
•
Just open terminal.
• Run ls and compare the result with Finder result.

39

Note
• Commands and messages in Terminal are describes with
“Courier Font”
• Lines starting from “#” is comment line. You do not
need to put them in Terminal.
• Lines whose last character is “” continue next line.
You put the multiple lines as one line.
• You can run commands with “cut and paste.”
• To do that, double quotation (“) character make trouble
because of difference of character types. Re-inputing
double quotation will solve the problem.
• Bar (|) can be input by Alt + 7.
• In Terminal, you can show history of your commands by
pushing up cursor.
• “Tab” key may complement your command or ﬁlename. 40

cat (conCATenate)
• cat [options] file ... $ cat text1.txt
How are you ?
• Original usage is ﬁle $ cat text2.txt
Hello!
concatenation. Thank you!
• Show detail later Good Bye!

• Some times this command is used
$ cat text1.txt text2.txt
How are you ?
to show inside of ﬁle. Hello!

• Options: Thank you!
Good Bye!
• -n: show line number $ cat -n text2.txt
1 Hello!
2 Thank you!
3 Good Bye!

41

head, tail (Show first or last
part of file)
• head [-n num] file ...
• Show first 10 lines $ cat text2.txt
• -n num: show first num lines Hello!

•
Thank you!
tail [-n num] file ... Good Bye!

•
$ head -n2 text2.txt
Show last 10 lines Hello!
• -n num: show last num lines Thank you!

•by setting +num, you can
$ tail -n2 text2.txt
Thank you!
see file from num-th line to Good Bye!
$ tail -n+3 text2.txt
last line. Good Bye!
• Because of large size of NGS file,
these commands are frequently
used.
• Most editors cannot open NGS
42
files.

less
• less <filename>

• Show files interactively
• Space: Next page
• ‘b’: Previous page
• ‘q’: Quit
• ‘/’ + [word]: search [word] and go to first matched
place. The word is highlighted.
• To move next place, press ‘n.’
• Frequently used to check contents of (large) file like
FastA file

43

cut -Show columns-
• cut [options] file ...

• Show selected columns
• Options:
• -f <list of nums>: Show <list of nums>-th columns. We
can use -d option to set separator between columns. Default
separator is “t (Tab).”
• -c <list of nums>: Show <list of nums>-th characters.
• Examples of “list of nums”
• 1,3,5: 1st, 3rd and 5th columns
• 1-5: From 1st to 5th columns
• 1,3,5-: 1st, 3rd and from 5th to last columns.
• This command is also frequently used to handle NGS ﬁles. 44

Sort
• sort [options] file ...

• Arrange ﬁle contents in alphabetical
order $ cat text2.txt

•
Hello!
Options: Thank you!

•
Good bye!
-r: reverse order $ sort text2.txt

• -n: order in numerical value
Good bye!
Hello!

• -k POS: order according to POS-th Thank you!
$ sort -r text2.txt
column. Default delimiter is “t.” Thank you!
We can change it with “-t” option. Hello!
Good bye!

45

$ cat nums.tab $ cat nums.tab
11.2 13.2 11.2 13.2
10.9 7.7 10.9 7.7
15.2 7.0 15.2 7.0
9.4 10.9 9.4 10.9
8.8 9.1 8.8 9.1
$ cut -f1 nums.tab $ sort -n nums.tab
11.2 8.8 9.1
10.9 9.4 10.9
15.2 10.9 7.7
9.4 11.2 13.2
8.8 15.2 7.0
$ cut -f1 -d . nums.tab $ sort -n -k2 nums.tab
11 15.2 7.0
10 10.9 7.7
15 8.8 9.1
9 9.4 10.9
8 11.2 13.2
$ cut -c1-3 nums.tab $ sort nums.tab
11. 10.9 7.7
10. 11.2 13.2
15. 15.2 7.0
9.4 8.8 9.1
8.8 9.4 10.9
46

Exercise (2)
• Generate two ﬁles “test1.txt” and “test2.txt”
• Run cat, head and tail command according to
examples.
• Generate ﬁle “nums.txt”
• Character between numbers (columns) is “tab.”
• Test cut and sort commands according to examples.

47

Redirect (>)
• command > file
• Save command result into “file.”
• Overwrite contents of file.
• The following command save the result of “sort -n nums.tab”
into “nums_sort.tab”
• command >> file
• Add command result to “file.”

$ sort -n nums.tab > nums_sort.tab
$ sort -n nums.tab >> nums_sort.tab

48

Pipe (|)
• command1 | command2
• Run command2 with command1’s result
$ sort -n nums.tab
8.8 9.1
9.4 10.9
10.9 7.7
11.2 13.2
15.2 7.0
$ sort -n nums.tab | cat -n
1 8.8 9.1
2 9.4 10.9
3 10.9 7.7
4 11.2 13.2
5 15.2 7.0
$ sort -n nums.tab | cat -n | head -n2
1 8.8 9.1
2 9.4 10.9

$ sort -n nums.tab | cat -n
produces the same result as
$ sort -n nums.tab > nums_sort.tab 49
$ cat -n nums_sort.tab

Commands used with pipe
• sort, cut
• less
• wc [options] ﬁle...
• Word Count
• Show number of lines, words and characters.

$ sort nums.tab | less
$ wc nums.tab
5 10 45 nums.tab
#lines #words #chrs
$ wc -l nums.tab
5 nums.tab Show only number of lines

50

gzip and bzip2
• Source codes and sample datasets are provided with tar and
gzip/bzip2 file.
• Only gzip/bzip2 is used for single file.
• “tar” can generate single file containing files and folders.
• gzip/bzip2 can compress file
• gzip is the most frequently used. bzip2 file size is smaller
than gzip.

$ ls -lh chr21.fa.gz
-rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz
$ gzip -d chr21.fa.gz Decompress hs_ref_chr21.fa.gz and
generate hs_ref_chr21.fa.
$ ls -lh chr21.fa
-rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa
$ gzip chr21.fa Compress

$ ls -lh chr21.fa.bz2 51
-rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2

tar (Tape ARchive)
• Generate single ﬁle containing ﬁles and folders.
• Frequently used with gzip/bzip2
• Remember the following idioms!
• We will use this to install programs to analyze NGS data.

with gzip
1. $ gzip -dc file.tar.gz | tar xvf -

2. $ tar zxvf file.tar.gz

with bzip2
1. $ bzip2 -dc file.tar.bz2 | tar xvf -

Tar has no option to decompress bzip2.

52

grep (g/re/p)
grep [options] file ... $ cat nums.tab

• Print lines matching pattern 11.2
10.9
13.2
7.7
• Options: 15.2 7.0
• -v: print non-matching lines 9.4
8.8
10.9
9.1
• -e <regular expression>: select line $ grep “7” nums.tab
with regular expression 10.9 7.7

•
15.2 7.0
Regular expression $ grep -v “7” nums.tab
• Speciﬁc pattern to express 11.2
9.4
13.2
10.9
character sequence 8.8 9.1
• ^: The beginning of line $ grep -e "^1" nums.tab

• $: The end of line
11.2
10.9
13.2
7.7
• Supported by most programming 15.2 7.0
languages. Very useful to handle
various formats including DNA/
Protein sequence.
53

Exercise (3)
• Use “TAIR10_chr1.fas”
• A.thaliana chromosome 1 sequence
• Select annotation line from FASTA format.
• FASTA format
• Line starting from “>” is annotation of sequence.
• The following lines of the annotation contains
nucleotide or amino acid sequence.
• To select an annotation, select lines starting from “>”
• Count number of nucleotides in (Multi) FASTA format
• Lines including nucleotides do not start from “>”
• Number of nucleotides = number of characters
• Use “wc” command
• Note that the end of line contains “Return” character
>gi|29028877|gb|BT005883|U23535
ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCA
54
TCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTC
CACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT

Let’s start NGS analysis!
• Dataset
• TAIR 10 genome (A.thaliana)
• 1/100 scale SOLiD RNA-Seq reads sets
•
Filenames: tha_reads.csfasta & tha_reads_QV.qual
•
SRR038985: 41,117,124 reads, 1,439,099,340 bp
•
http://trace.ddbj.nig.ac.jp/DRASearch/experiment?
acc=SRX018529
•
Filenames: lyr_reads.csfasta & lyr_reads_QV.qual
•
SRR038987: 41,340,154 reads, 1,446,905,390 bp
•
http://trace.ddbj.nig.ac.jp/DRASearch/experiment?
acc=SRX018531
• 1/10 scale Roche 454 Read Set (SRR020799)

$ grep -e “^>” tha_reads.csfasta | wc -l
55
411171

Installing BWA
• In this lecture, because our computer do not have “gcc”
command to compile C language, we skip this procedure.
• Download BWA
• http://bio-bwa.sourceforge.net/
• bwa-0.5.8c.tar.bz2 exists in USB. Copy the ﬁle.
• Extract the ﬁle
• Move into BWA directory
• Compile source programs
• Make alias name “bwa” for bwa-0.5.8c directory
# $ curl -O
# http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2
# $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf -
# ...filenames...
# $ ln -s bwa-0.5.8c bwa # Simplify the directory name
# $ cd bwa
# $ make
# ...compile messages...
# $ cd .. # back to working directory 56

Prepare A.thaliana Genome
• Download chromosomes from TAIR site
• http://www.arabidopsis.org/
• Find URLs by selecting “Download” tab > Sequences >
whole_chromosomes
• Each file includes one chromosome on current version.
• TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas,
TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas,
TAIR10_chrM.fas
• Because of limited server and network capacity, distributed
these files with USB or web site for this lecture.
• Concatenate these chromosomes except chloroplast and
mitochondria into single file

57

# We skip this process
#$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/
whole_chromosomes/TAIR10_chr[1-5].fas”
## 1-5 means consecutive numbers from 1 to 5.
## We do not use chroloplast and mitochondria genomes.
# Instead of the download, we use the files in USB.
# The files are in your working directory.
# Check it by below command.
$ ls TAIR10*
TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fas
TAIR10_chr2.fas TAIR10_chr4.fas
# Concatinate all chromosomes into single file
$ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas
TAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas
# Check the result
$ grep -e “^>” TAIR10_chr_all.fas
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated:
2009-02-02
>Chr2...
# You can find 5 chromosomes’ annotations
58

Run BWA
• Make index on genome sequence
• For SOLiD reads, “-c” option is required.
• This process needs just once as long as you use the same
genome (do not depend on read sequences).
• Convert reads’ colorspace into BWA speciﬁc format
• You don’t need this process for illumina reads.
• Illumina sequencers produce FastQ format ﬁles, and most
alignment software can handle that directly.
• Mapping reads against genome sequence
• If you use illumina, -I option may be required. Check your
illumina version.
• Above two processes may take long time. This lecture’s toy data
is 1/100 scale. For real data will require more than two hours.

$ ./bwa/bwa index -c TAIR10_chr_all.fas
# running messages. Takes more than 3 mins.
$ python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa
$ ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai
# messages...about 1min. Alignment phase. 59

Run BWA (continued)
• Convert mapping result into SAM format.
• You have to use “sampe” instead of “samse” for paired end
experiment to put mate pair information into SAM format.
• That’s all! Check the contents of sam ﬁle with less command.
• How many reads can be mapped against genome?

$ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa >
tha_reads.sam
# messages. Generate summary of alignment.
# If you have paired ended reads, you can use sampe instead of samse.

$ less tha_reads.sam
# Press “q” to quit less command.
# Next page is “space”
60

Inside of SAM ﬁle
Chromosome (Mapped
database) information
@SQ SN:Chr1 LN:30427671 Used program and its variables
@SQ SN:Chr2 LN:19698289
@SQ SN:Chr3 LN:23459830
@SQ SN:Chr4 LN:18585056
Mapped read in forward
@SQ SN:Chr5 LN:26975502
@PG ID:bwa PN:bwa VN:0.5.9-r16 direction on Chr5
SRR038985.100 0 Chr5 22828962 37 33M *
0 0 GCCGGTGATGTAATCAAAATATTTGCTACTCTT WZYTWWTW]
YVUOW]OEKNUUX]PJSRY][63 XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i:
1 XO:i:0 XG:i:0 MD:Z:33
SRR038985.200 0 Chr3 14197678 0 33M *
0 0 ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG X]]KN]]
YWUX]XIKYRCHSUYX[[SNQJL[MO XT:A:R CM:i:0 X0:i:2 X1:i:0
XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0;
SRR038985.300 4 * 0 0 * * 0
0 AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT 124,/08/5&6-&,(;/4+
%7,+5.:1',*;8:&
61
Unmapped read

Exercise (4)
• Run BWA
• Compare file size of csfasta + qual files with generated SAM file.
• Which is larger? How much disk space we need to analyze?
• Check the details of SAM file
• Format details are described in http://
samtools.sourceforge.net/SAM1.pdf
• How many reads are mapped onto chromosomes.
• Select lines containing “Chr” # use grep
• Then, count the number of lines # use wc
• Calculate ratio of mapped reads to total reads.

62

Problems
• Mapped read ratio may be very lower than expected.
• Genome quality is (probably) high.
• Various problems
• Wet problems
• Protocols and reagents
• Mitochondria and chroloplast.
• Dry problems
• We used all sequences. We may need to remove low
quality reads.
• Sequence quality of 3’-end is low. We might trim these
sequence.
• We did not care about reads on splice junction.
• We did not change any parameters in BWA. The
parameter might not be suitable for our reads.
• No one has versatile result.
• Note!!! mapped ratio of current RNA-Seq reads is (extremely) 63
higher than this result.

20110524zurichngs 1st pub

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to 20110524zurichngs 1st pub

Similar to 20110524zurichngs 1st pub (20)

More from sesejun

More from sesejun (14)

20110524zurichngs 1st pub