1. Next Generation Sequencing for
Model and Non-Model Organism.
1st day
Jun Sese and Kentaro Shimizu
sesejun@cs.titech.ac.jp
Ph.D course @ Univ. of Zurich
25/05/2011
2. Today’s Menu
• Lecture
• Overview of next generation sequencer’s analysis
• Mapping: Sequence alignment
• Introduction to UNIX to handle NGS data
• Exercise
• UNIX commands
• Mapping real short reads against genomes
• Compute statistics of the mapped reads
2
3. Various Types of Sequencers
• Roche 454, IonTorrent
• Roche: about 400bp, Ion Torrent: about 200bp
• Suitable for denovo sequencing
• Illumina HiSeq
• Widely-used new generation sequencer
• 100bpx2 up to 600 Gb/run (HiSeq 2000)
• MiSeq uses almost same technology except number
of reads
• ABI SOLiD
• 75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run
(5500xl SOLiD)
• Color Space
• Pacific Biosciences PacBio RS
• Average > 500 bp
• Sequence quality is not high.
3
5. How large is it?
• Generated file size is more than 300GB/run
• We can read data from hard disks with 100 MB/sec
• 300GB / 100MB/sec
= 300,000MB / 100MB/sec
= 3000 sec
= 50min
• To just read the data from HDD, computer takes 50min!
• Require efficient calculation
5
6. Applications of DNA Sequencing
• NGS just read enormous short sequences, but has
many biological applications.
• Genetic variation
• Gene regulations
• RNA-seq
• ChIP-seq
• Epigenetics
• Population genetics
Science 2007 6
7. Sequencerʼs Output
Genome Sequence
Mapping Program
Mapping Result
Visualization Further Analysis
SNPs, RNA-Seq,... 7
8. Major Pipelines of NGS
• Most of the applications use the similar procedure.
Genetic variation RNA-Seq ChIP-Seq
Find
originated Map Map Map
region (Alignment)
Check regulatory
Filter SNP call Measure expressions regions
Analysis Find difference Same as microarray Same as ChIP-
Chip analysis
Most of them require whole genome sequence to map reads.
8
9. Mapping (Pairwise Alignment)
• Find the place from which each read comes
• BLAST is one of the very famous alignment software.
• Few NGS analysis use BLAST/BLAT because of slow alignment
speed.
• BWA and Bowtie have been used to map short reads.
Reads ATATGCGA
ATATGCGA
Reference GATGCTAAGCATATGCGAGGCATGCCATATGGATG
We may find multiple mapped places.
Score matrix (distance) defines which map is better.
Reads ATATGCGA
ATATGCGA ATATG-CGA
x
Reference GATGCTAAGCAAATGCGAGGCATGCCATATGGCGA 9
11. For non-model organism
Genetic Variation Chip-Seq RNA-Seq
Read normalized
Read genome Read genome library
Genome/Gene
Sequence Genome Genome RNA
assembly assembly Assembly
Map onto
Map new reads Map ChIP-Seq
related species
Map Count genome
reads assembled
reads Map new
RNA-Seq reads
Check regulatory
Filter SNP call
regions Measure expressions
Similar to
Analysis Find Difference Same as microarray
ChIP-Chip
Most cases require genome assembly,
which is experimentally and computationally high cost 11
12. Very Short History of
Pairwise Alignment Programs
• More than 100 alignment programs are listed in Wikipedia!!!
• http://en.wikipedia.org/wiki/Sequence_alignment_software
• 1 sequence vs 1 sequence
• Ssearch, FASTA [Lipman and Pearson. 1985]
• 1 sequence vs Whole genes
• BLAST [Altschul et al. 1990]
• Thousands of sequences vs Whole genes or Whole genomes
• BLAT [Kent. 2002]
• Billions of short sequences vs Whole genome
• BWA, Bowtie, SHRiMP, etc...
• Most modern mappers use FM-index [Ferragina and
Manzini. 2000] with Burrows-Wheeler transform [Burrows
and Wheeler. 1994]. 12
13. Why so many alignment
programs have been developed?
• Computer scientist seems that alignment is easy task.
• Both indexing and dynamic programming used in
sequence alignment are basic algorithm.
• Good problem for home work
• A little performance tuning can accelerates execution
speed dramatically
• In reality, alignment problem is very hard to solve.
• Mutations, insertions, deletions...
• Each sequencer has unique bias.
• Sequence length. Homo-polymer in Roche 454...
• Many heuristics exist in biologist!
• GT-AG rule on splice site, but not always...
• That is, problem definition is ambiguous! 13
14. Alignment performance varies
• Aligned 12million single end reads against human genome
sequences (hg18)
• Algorithm and implementation difference appear in total processed
time
• In most program, used memory depends on genome size.
• Parameter settings reflect numbers of mapped reads.
• Authors did not mention about them.
• In real experiments, we have to change parameters to use
alignment program.
Bao et al. J Hum Genet, 2011
14
15. Sequencerʼs Output
Sequence Format
Genome Sequence
Mapping Program BWA, Bowtie, etc.
Mapping Result
Visualization
15
17. Sequence File Format (2)
• FASTQ
• Used by Illumina sequencers
• Sequence database sites (SRA(Short read archive)/ENA
(European Nucleotide Archive)/DRA(DDBJ Sequence Read
Archive)) provide sequences with this format.
• De-facto standard
• CSFasta + Quality file
• Only used in SOLiD sequencers
• Similar to fasta file except sequences are described in color
space.
>SRR038985.100 VAB_AT1deg1_51_269_F3
T10303011231130321000333001323122221
>SRR038985.200 VAB_AT1deg1_78_430_F3
T03102101012320213012132121333132011
>SRR038985.100 VAB_AT1deg1_51_269_F3
0 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 23
19 26 28 9 22 18 21 25 25 23 2 20
>SRR038985.200 VAB_AT1deg1_78_430_F3
0 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 21
25 20 28 20 20 15 23 8 25 23 11 25 17
18. Color Space
• ABI SOLiD unique format.
• Each number represents two base pair
• Each nucleotide are in the SOLiD™ System: the Theory, Advantages and Solutions
Color Space Analysis read twice
• A spot detection miss may change downstream sequence.
• Introduction
The SOLiD™ System is the only next generationthis format.
Some softwares did not support
sequencing system to employ ligation based chemistry
2nd Base
with di-base labelled probes. This unique approach
provides significant advantages in terms of system
1st Base
accuracy and downstream data analysis.
T10303011
Unique built-in error checking capability
distinguishes between measurement errors and
true polymorphisms
Detection of more complicated genetic variation
TGGCCGGTG
such as adjacent SNPs, insertions, deletions and
structural variations Double Interrogation: Each base is defined twice
T10203011
Properties for a 2 Base Color Code Scheme
The color code scheme is based on the Klein four-
A T C A A
group, which is the symmetry group of a rectangle.
ABI White Paper: Figure 1: SOLiD Color Space Code
TGGAATTGT
It was designed to have the following properties which Color Space Analysis in the SOLiD
enable the unique error checking capability.
System: the Theory, Advantages and Solutions
18
19. FASTQ Format
One read
@SRR013343.216 :3:1:837:436 Name
GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT Sequence
+
I6IIII*II*II+I:+&I)I'&%&%,+0>+'I''$G Quality Score
@SRR013343.217 :3:1:974:526
GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC
+
I@II6I<I/III;II+)I*II*DI*I?')+*+8/%8
@SRR013343.218 :3:1:755:341
GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC
+
IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I'
19
20. PHRED quality encoding
−Q
Q = −10 log10 P ⇔ P = 10 10
• Q=20: 99% accuracy, Q=30: 99.9% accuracy
• Quality value scale is slightly different between PHRED
and illumina/SOLiD results
• Encoded in FASTQ and SAM by quality string of “ASCII
value - 33”
• For illumina 1.3+, ASCII character has been changed to
ASCII-64 character.
! 33 ‘ 39 - 45 3 51 9 57 ? 63 ...
“ 34 ( 40 . 46 4 52 : 58 @ 64 ...
# 35 ) 41 / 47 5 53 ; 59 A 65 ...
$ 36 * 42 0 48 6 54 < 60 B 66 ...
% 37 + 43 1 49 7 55 = 61 C 67 ...
& 38 , 44 2 50 8 56 > 62 D 68 ...
20
21. Sequencerʼs Output
Sequence Format
Genome Sequence
Mapping Program BWA, Bowtie, etc.
Mapping Result Output Format
Visualization
21
22. SAM Format
• Sequence Alignment / Map format
• Simple tab-delimited text file
• Standardized alignment output format
• Modern alignment tools support this format
• BAM format is binary version of SAM format.
@HD VN:1.0
@SQ! SN:chr20 LN:62435964
@RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
@RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195
AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<
NM:i:1 RG:Z:L1
read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168
ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<
MF:i:18 RG:Z:L2
22
25. CIGAR
• Show alignment result simply
• 8M9I7M
• 8bp match, 9bp insertion, and then 7bp match
8M 9I 7M
CATATGCG---------ATATGGA
|||||||| |||||||
GATGCTAAGCATATGCGAGGCATGCCATATGGATG
4th line “POS” indicates this position. 25
26. Summary
• No standard tools for analyzing NGS data
• QA sites are good resources
• SeqAnswers.com
• biostar.stackexchange.com
• Many algorithms and softwares have been
developed.
• See. http://www.oxfordjournals.org/our_journals/
bioinformatics/nextgenerationsequencing.html
• Most of them work with UNIX command line
• Few analysis tools with GUI
• Galaxy (Free, require server setup)
• BioScope (Only available with SOLiD sequencer)
26
27. Unix Commands
Sequencerʼs Output
Sequence Format
Genome Sequence
Performed Mapping Program BWA, Bowtie, etc.
with UNIX
commands
Mapping Result Output Format
Visualization
27
28. Preparation
• NGS procedure generate many files.
• Even in this lecture, we will generate 50 files.
• We use directory generated by extracting “ngslec.zip.”
• Extract the zip file in your home directory.
• To move to the directory, we type the following command
in Terminal
$ cd ngslec
$ pwd
/Users/YOUR_DIRECTORY/ngslec/
28
29. Use “Terminal”
• Operating System (OS) handle movements on computer.
• Read files, mouse click, visualize characters, ...
• We can use the OS functions through application “Terminal” on
UNIX OS
• Applications > Utilities > Terminal
• UNIX: Linux, IBM AIX, Sun OS, Mac OS X
• except Windows and Mac OS -9
• In the terminal, we can use shell commands.
• Applications consists of a procedure of the shell commands.
• A complicated program is made of a set of tiny programs.
• We start to learn usage of tiny programs, and then how to
combine them.
Kernel Shell Terminal
29
30. Command and Arguments
$ rm -r arg1 arg2
(A) Command (Order): run a command called “rm”
(B),(C) and (D) Arguments: separated by space character
between command and arguments and between arguments
(B) Arguments that change sub functions of the command are
called “Option.” Options starts from “-” or “--”
(C) First argument. We count argument number except options.
(D) Second argument.
30
31. Example: date command
• Input “date” + [Return] to show current time
• With option “-u”, “date” command shows
Coordinated Universal time.
• If you misspell command, terminal says “command
not found.”
• Commands (and file names) are case sensitive on
UNIX except Mac OS X.
31
32. File System
• You may always use this system through “Finder.” In this lecture,
we will use this from “Terminal.”
• Tree structure rooted by “/”
• USB memories and DVDs are also managed through file system.
/
usr Volume
bin lib pics
USB zurich
32
33. Directories and Files
• Current directory
/ • Directory on which you are working
• You can check “pwd” command.
usr Users • Home directory
* • Root (top) of your personal directory
bin lib sesejun
• Denoted by “~” or “$HOME”
• When your current directory is “/Users/
usr sesejun”
** • pwd command shows /Users/sesejun
lib
• /usr/lib indicates *
• usr/lib indicates **
• “.” is equal to “/Users/sesejun”
• .. is equal to /Users
• ../../usr/lib is equal to “/usr/lib”
33
34. cd: Change Directory
• cd destination-dir
• move your current directory to destination-dir
• When you omit (unset) arguments, move to home
dir.
jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$ cd /usr/
jsmbp:/usr sesejun$ pwd
/usr
jsmbp:/usr sesejun$ cd lib
jsmbp:/usr/lib sesejun$ pwd
/usr/lib
jsmbp:/usr/lib sesejun$ cd /usr/bin/
jsmbp:/usr/bin sesejun$ pwd
/usr/bin
jsmbp:/usr/bin sesejun$ cd
jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$ 34
35. ls (LiSt): Show List of Files
• Show current directory files when setting no arguments
• Important options
• -a: Show all files (Files starting from “.” do not appear
when we do not set this option)
• -l: Show detail information of files
• -h: Show file size in human friendly format (usually used
with option “-l”)
•
$ ls
Desktop Music largefile
$ ls -l
drwx------+ 8 sesejun staff 272 5 16 00:09 Desktop
drwx------+ 3 sesejun staff 102 10 27 2010 Movies
-rw-r--r-- 1 sesejun staff 4181139 5 16 08:20 largefile
$ ls -lh
drwx------+ 8 sesejun staff 272B 5 16 00:09 Desktop
drwx------+ 3 sesejun staff 102B 10 27 2010 Movies
-rw-r--r-- 1 sesejun staff 4.0M 5 16 08:20 largefile
35
37. mv: Move files
• Also used to change file names
• mv [options] source-file ... directory
• mv [options] old-path new-path
• Change filename text1.txt to text2.txt
$ mv text1.txt text2.txt
• Move text1.txt and text2.txt into tmp directory
$ mv text1.txt text2.txt tmp/
$ ls
tmp
$ ls tmp/
text1.txt text2.txt
37
38. rm (ReMove): Delete files
• Options:
• -r: Remove all the files in directory
• -i: Confirm before removing each file.
• Delete text1.txt and text2.txt
jsmbp:~ sesejun$ rm text1.txt text2.txt
• Delete all the files within tmp directory
• Note: These files are “really” removed. They never
go to “Trash.” We cannot use undo.
jsmbp:~/test sesejun$ ls
tmp
jsmbp:~/test sesejun$ ls tmp/
text1.txt text2.txt
jsmbp:~/test sesejun$ rm -r tmp/
jsmbp:~/test sesejun$ ls
jsmbp:~/test sesejun$ 38
39. Exercise (1)
• Run commands
• Run date and date -u, and check the results.
• Run command “cal” What is the result?
• Change directory
• Run examples in page “cd”
• Check make and remove directory
• Open your login name directory in Finder.
• Move your home directory in Terminal.
•
Just open terminal.
• Run ls and compare the result with Finder result.
39
40. Note
• Commands and messages in Terminal are describes with
“Courier Font”
• Lines starting from “#” is comment line. You do not
need to put them in Terminal.
• Lines whose last character is “” continue next line.
You put the multiple lines as one line.
• You can run commands with “cut and paste.”
• To do that, double quotation (“) character make trouble
because of difference of character types. Re-inputing
double quotation will solve the problem.
• Bar (|) can be input by Alt + 7.
• In Terminal, you can show history of your commands by
pushing up cursor.
• “Tab” key may complement your command or filename. 40
41. cat (conCATenate)
• cat [options] file ... $ cat text1.txt
How are you ?
• Original usage is file $ cat text2.txt
Hello!
concatenation. Thank you!
• Show detail later Good Bye!
• Some times this command is used
$ cat text1.txt text2.txt
How are you ?
to show inside of file. Hello!
• Options: Thank you!
Good Bye!
• -n: show line number $ cat -n text2.txt
1 Hello!
2 Thank you!
3 Good Bye!
41
42. head, tail (Show first or last
part of file)
• head [-n num] file ...
• Show first 10 lines $ cat text2.txt
• -n num: show first num lines Hello!
•
Thank you!
tail [-n num] file ... Good Bye!
•
$ head -n2 text2.txt
Show last 10 lines Hello!
• -n num: show last num lines Thank you!
•by setting +num, you can
$ tail -n2 text2.txt
Thank you!
see file from num-th line to Good Bye!
$ tail -n+3 text2.txt
last line. Good Bye!
• Because of large size of NGS file,
these commands are frequently
used.
• Most editors cannot open NGS
42
files.
43. less
• less <filename>
• Show files interactively
• Space: Next page
• ‘b’: Previous page
• ‘q’: Quit
• ‘/’ + [word]: search [word] and go to first matched
place. The word is highlighted.
• To move next place, press ‘n.’
• Frequently used to check contents of (large) file like
FastA file
43
44. cut -Show columns-
• cut [options] file ...
• Show selected columns
• Options:
• -f <list of nums>: Show <list of nums>-th columns. We
can use -d option to set separator between columns. Default
separator is “t (Tab).”
• -c <list of nums>: Show <list of nums>-th characters.
• Examples of “list of nums”
• 1,3,5: 1st, 3rd and 5th columns
• 1-5: From 1st to 5th columns
• 1,3,5-: 1st, 3rd and from 5th to last columns.
• This command is also frequently used to handle NGS files. 44
45. Sort
• sort [options] file ...
• Arrange file contents in alphabetical
order $ cat text2.txt
•
Hello!
Options: Thank you!
•
Good bye!
-r: reverse order $ sort text2.txt
• -n: order in numerical value
Good bye!
Hello!
• -k POS: order according to POS-th Thank you!
$ sort -r text2.txt
column. Default delimiter is “t.” Thank you!
We can change it with “-t” option. Hello!
Good bye!
45
47. Exercise (2)
• Generate two files “test1.txt” and “test2.txt”
• Run cat, head and tail command according to
examples.
• Generate file “nums.txt”
• Character between numbers (columns) is “tab.”
• Test cut and sort commands according to examples.
47
48. Redirect (>)
• command > file
• Save command result into “file.”
• Overwrite contents of file.
• The following command save the result of “sort -n nums.tab”
into “nums_sort.tab”
• command >> file
• Add command result to “file.”
$ sort -n nums.tab > nums_sort.tab
$ sort -n nums.tab >> nums_sort.tab
48
50. Commands used with pipe
• sort, cut
• less
• wc [options] file...
• Word Count
• Show number of lines, words and characters.
$ sort nums.tab | less
$ wc nums.tab
5 10 45 nums.tab
#lines #words #chrs
$ wc -l nums.tab
5 nums.tab Show only number of lines
50
51. gzip and bzip2
• Source codes and sample datasets are provided with tar and
gzip/bzip2 file.
• Only gzip/bzip2 is used for single file.
• “tar” can generate single file containing files and folders.
• gzip/bzip2 can compress file
• gzip is the most frequently used. bzip2 file size is smaller
than gzip.
$ ls -lh chr21.fa.gz
-rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz
$ gzip -d chr21.fa.gz Decompress hs_ref_chr21.fa.gz and
generate hs_ref_chr21.fa.
$ ls -lh chr21.fa
-rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa
$ gzip chr21.fa Compress
$ ls -lh chr21.fa.bz2 51
-rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2
52. tar (Tape ARchive)
• Generate single file containing files and folders.
• Frequently used with gzip/bzip2
• Remember the following idioms!
• We will use this to install programs to analyze NGS data.
with gzip
1. $ gzip -dc file.tar.gz | tar xvf -
2. $ tar zxvf file.tar.gz
with bzip2
1. $ bzip2 -dc file.tar.bz2 | tar xvf -
Tar has no option to decompress bzip2.
52
53. grep (g/re/p)
grep [options] file ... $ cat nums.tab
• Print lines matching pattern 11.2
10.9
13.2
7.7
• Options: 15.2 7.0
• -v: print non-matching lines 9.4
8.8
10.9
9.1
• -e <regular expression>: select line $ grep “7” nums.tab
with regular expression 10.9 7.7
•
15.2 7.0
Regular expression $ grep -v “7” nums.tab
• Specific pattern to express 11.2
9.4
13.2
10.9
character sequence 8.8 9.1
• ^: The beginning of line $ grep -e "^1" nums.tab
• $: The end of line
11.2
10.9
13.2
7.7
• Supported by most programming 15.2 7.0
languages. Very useful to handle
various formats including DNA/
Protein sequence.
53
54. Exercise (3)
• Use “TAIR10_chr1.fas”
• A.thaliana chromosome 1 sequence
• Select annotation line from FASTA format.
• FASTA format
• Line starting from “>” is annotation of sequence.
• The following lines of the annotation contains
nucleotide or amino acid sequence.
• To select an annotation, select lines starting from “>”
• Count number of nucleotides in (Multi) FASTA format
• Lines including nucleotides do not start from “>”
• Number of nucleotides = number of characters
• Use “wc” command
• Note that the end of line contains “Return” character
>gi|29028877|gb|BT005883|U23535
ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCA
54
TCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTC
CACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT
56. Installing BWA
• In this lecture, because our computer do not have “gcc”
command to compile C language, we skip this procedure.
• Download BWA
• http://bio-bwa.sourceforge.net/
• bwa-0.5.8c.tar.bz2 exists in USB. Copy the file.
• Extract the file
• Move into BWA directory
• Compile source programs
• Make alias name “bwa” for bwa-0.5.8c directory
# $ curl -O
# http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2
# $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf -
# ...filenames...
# $ ln -s bwa-0.5.8c bwa # Simplify the directory name
# $ cd bwa
# $ make
# ...compile messages...
# $ cd .. # back to working directory 56
57. Prepare A.thaliana Genome
• Download chromosomes from TAIR site
• http://www.arabidopsis.org/
• Find URLs by selecting “Download” tab > Sequences >
whole_chromosomes
• Each file includes one chromosome on current version.
• TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas,
TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas,
TAIR10_chrM.fas
• Because of limited server and network capacity, distributed
these files with USB or web site for this lecture.
• Concatenate these chromosomes except chloroplast and
mitochondria into single file
57
58. # We skip this process
#$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/
whole_chromosomes/TAIR10_chr[1-5].fas”
## 1-5 means consecutive numbers from 1 to 5.
## We do not use chroloplast and mitochondria genomes.
# Instead of the download, we use the files in USB.
# The files are in your working directory.
# Check it by below command.
$ ls TAIR10*
TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fas
TAIR10_chr2.fas TAIR10_chr4.fas
# Concatinate all chromosomes into single file
$ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas
TAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas
# Check the result
$ grep -e “^>” TAIR10_chr_all.fas
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated:
2009-02-02
>Chr2...
# You can find 5 chromosomes’ annotations
58
59. Run BWA
• Make index on genome sequence
• For SOLiD reads, “-c” option is required.
• This process needs just once as long as you use the same
genome (do not depend on read sequences).
• Convert reads’ colorspace into BWA specific format
• You don’t need this process for illumina reads.
• Illumina sequencers produce FastQ format files, and most
alignment software can handle that directly.
• Mapping reads against genome sequence
• If you use illumina, -I option may be required. Check your
illumina version.
• Above two processes may take long time. This lecture’s toy data
is 1/100 scale. For real data will require more than two hours.
$ ./bwa/bwa index -c TAIR10_chr_all.fas
# running messages. Takes more than 3 mins.
$ python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa
$ ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai
# messages...about 1min. Alignment phase. 59
60. Run BWA (continued)
• Convert mapping result into SAM format.
• You have to use “sampe” instead of “samse” for paired end
experiment to put mate pair information into SAM format.
• That’s all! Check the contents of sam file with less command.
• How many reads can be mapped against genome?
$ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa >
tha_reads.sam
# messages. Generate summary of alignment.
# If you have paired ended reads, you can use sampe instead of samse.
$ less tha_reads.sam
# Press “q” to quit less command.
# Next page is “space”
60
61. Inside of SAM file
Chromosome (Mapped
database) information
@SQ SN:Chr1 LN:30427671 Used program and its variables
@SQ SN:Chr2 LN:19698289
@SQ SN:Chr3 LN:23459830
@SQ SN:Chr4 LN:18585056
Mapped read in forward
@SQ SN:Chr5 LN:26975502
@PG ID:bwa PN:bwa VN:0.5.9-r16 direction on Chr5
SRR038985.100 0 Chr5 22828962 37 33M *
0 0 GCCGGTGATGTAATCAAAATATTTGCTACTCTT WZYTWWTW]
YVUOW]OEKNUUX]PJSRY][63 XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i:
1 XO:i:0 XG:i:0 MD:Z:33
SRR038985.200 0 Chr3 14197678 0 33M *
0 0 ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG X]]KN]]
YWUX]XIKYRCHSUYX[[SNQJL[MO XT:A:R CM:i:0 X0:i:2 X1:i:0
XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0;
SRR038985.300 4 * 0 0 * * 0
0 AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT 124,/08/5&6-&,(;/4+
%7,+5.:1',*;8:&
61
Unmapped read
62. Exercise (4)
• Run BWA
• Compare file size of csfasta + qual files with generated SAM file.
• Which is larger? How much disk space we need to analyze?
• Check the details of SAM file
• Format details are described in http://
samtools.sourceforge.net/SAM1.pdf
• How many reads are mapped onto chromosomes.
• Select lines containing “Chr” # use grep
• Then, count the number of lines # use wc
• Calculate ratio of mapped reads to total reads.
62
63. Problems
• Mapped read ratio may be very lower than expected.
• Genome quality is (probably) high.
• Various problems
• Wet problems
• Protocols and reagents
• Mitochondria and chroloplast.
• Dry problems
• We used all sequences. We may need to remove low
quality reads.
• Sequence quality of 3’-end is low. We might trim these
sequence.
• We did not care about reads on splice junction.
• We did not change any parameters in BWA. The
parameter might not be suitable for our reads.
• No one has versatile result.
• Note!!! mapped ratio of current RNA-Seq reads is (extremely) 63
higher than this result.