20110524zurichngs 1st pub

1,440 views
1,319 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,440
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20110524zurichngs 1st pub

  1. 1. Next Generation Sequencing forModel and Non-Model Organism. 1st day Jun Sese and Kentaro Shimizu sesejun@cs.titech.ac.jp Ph.D course @ Univ. of Zurich 25/05/2011
  2. 2. Today’s Menu• Lecture • Overview of next generation sequencer’s analysis • Mapping: Sequence alignment • Introduction to UNIX to handle NGS data• Exercise • UNIX commands • Mapping real short reads against genomes • Compute statistics of the mapped reads 2
  3. 3. Various Types of Sequencers• Roche 454, IonTorrent • Roche: about 400bp, Ion Torrent: about 200bp • Suitable for denovo sequencing• Illumina HiSeq • Widely-used new generation sequencer • 100bpx2 up to 600 Gb/run (HiSeq 2000) • MiSeq uses almost same technology except number of reads• ABI SOLiD • 75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run (5500xl SOLiD) • Color Space• Pacific Biosciences PacBio RS • Average > 500 bp • Sequence quality is not high. 3
  4. 4. Sequence cost becomes low dramatically Lincoln Stein, Genome Biology, vol. 11(5), 2010 4
  5. 5. How large is it?• Generated file size is more than 300GB/run• We can read data from hard disks with 100 MB/sec• 300GB / 100MB/sec = 300,000MB / 100MB/sec = 3000 sec = 50min• To just read the data from HDD, computer takes 50min! • Require efficient calculation 5
  6. 6. Applications of DNA Sequencing • NGS just read enormous short sequences, but has many biological applications. • Genetic variation • Gene regulations • RNA-seq • ChIP-seq • Epigenetics • Population genetics Science 2007 6
  7. 7. Sequencerʼs Output Genome SequenceMapping Program Mapping Result Visualization Further Analysis SNPs, RNA-Seq,... 7
  8. 8. Major Pipelines of NGS • Most of the applications use the similar procedure. Genetic variation RNA-Seq ChIP-Seq Findoriginated Map Map Map region (Alignment) Check regulatory Filter SNP call Measure expressions regionsAnalysis Find difference Same as microarray Same as ChIP- Chip analysis Most of them require whole genome sequence to map reads. 8
  9. 9. Mapping (Pairwise Alignment) • Find the place from which each read comes • BLAST is one of the very famous alignment software. • Few NGS analysis use BLAST/BLAT because of slow alignment speed. • BWA and Bowtie have been used to map short reads. Reads ATATGCGA ATATGCGAReference GATGCTAAGCATATGCGAGGCATGCCATATGGATGWe may find multiple mapped places.Score matrix (distance) defines which map is better. Reads ATATGCGA ATATGCGA ATATG-CGA x Reference GATGCTAAGCAAATGCGAGGCATGCCATATGGCGA 9
  10. 10. 10
  11. 11. For non-model organism Genetic Variation Chip-Seq RNA-Seq Read normalized Read genome Read genome libraryGenome/Gene Sequence Genome Genome RNA assembly assembly Assembly Map onto Map new reads Map ChIP-Seq related species Map Count genome reads assembled reads Map new RNA-Seq reads Check regulatory Filter SNP call regions Measure expressions Similar to Analysis Find Difference Same as microarray ChIP-Chip Most cases require genome assembly, which is experimentally and computationally high cost 11
  12. 12. Very Short History of Pairwise Alignment Programs• More than 100 alignment programs are listed in Wikipedia!!! • http://en.wikipedia.org/wiki/Sequence_alignment_software• 1 sequence vs 1 sequence • Ssearch, FASTA [Lipman and Pearson. 1985]• 1 sequence vs Whole genes • BLAST [Altschul et al. 1990]• Thousands of sequences vs Whole genes or Whole genomes • BLAT [Kent. 2002]• Billions of short sequences vs Whole genome • BWA, Bowtie, SHRiMP, etc... • Most modern mappers use FM-index [Ferragina and Manzini. 2000] with Burrows-Wheeler transform [Burrows and Wheeler. 1994]. 12
  13. 13. Why so many alignment programs have been developed?• Computer scientist seems that alignment is easy task. • Both indexing and dynamic programming used in sequence alignment are basic algorithm. • Good problem for home work • A little performance tuning can accelerates execution speed dramatically• In reality, alignment problem is very hard to solve. • Mutations, insertions, deletions... • Each sequencer has unique bias. • Sequence length. Homo-polymer in Roche 454... • Many heuristics exist in biologist! • GT-AG rule on splice site, but not always... • That is, problem definition is ambiguous! 13
  14. 14. Alignment performance varies• Aligned 12million single end reads against human genome sequences (hg18)• Algorithm and implementation difference appear in total processed time • In most program, used memory depends on genome size.• Parameter settings reflect numbers of mapped reads. • Authors did not mention about them. • In real experiments, we have to change parameters to use alignment program.Bao et al. J Hum Genet, 2011 14
  15. 15. Sequencerʼs Output Sequence Format Genome SequenceMapping Program BWA, Bowtie, etc. Mapping Result Visualization 15
  16. 16. Sequence File Format (1) • FASTA + Quality File • Used by Roche 454>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_GCGTTGTGTATGTCTCCTTTGGTATGTCAGGTTTCGTCAGAAGCTTCTATCAAACGGCGCACAGTGA>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_TCGGCCCTATCCGAGAAGGCGTGGTGTATCTCTCTTCTGGTATGCCACGTTACGCAGCAGCTTCTTCCCAAGACACAGAGCGAGTAAG>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_37 35 35 35 35 35 37 37 37 37 37 39 39 37 36 35 35 36 37 37 37 37 35 35 32 28 27 27 27 2729 23 21 21 14 14 12 18 19 19 19 19 19 19 16 16 17 20 22 20 12 12 12 12 11 17 17 17 16 1922 23 24 21 21 21 18>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_29 30 19 19 19 20 19 24 28 27 27 27 27 27 30 19 19 20 20 20 24 33 33 33 33 33 33 33 35 3537 37 30 30 30 30 32 32 32 32 35 32 32 32 32 33 33 33 33 20 20 20 23 27 30 30 31 31 27 2727 27 28 23 24 24 23 23 23 24 24 21 17 19 19 18 27 18 17 16 16 16 17 13 18 17 16 12 16
  17. 17. Sequence File Format (2) • FASTQ • Used by Illumina sequencers • Sequence database sites (SRA(Short read archive)/ENA (European Nucleotide Archive)/DRA(DDBJ Sequence Read Archive)) provide sequences with this format. • De-facto standard • CSFasta + Quality file • Only used in SOLiD sequencers • Similar to fasta file except sequences are described in color space.>SRR038985.100 VAB_AT1deg1_51_269_F3T10303011231130321000333001323122221>SRR038985.200 VAB_AT1deg1_78_430_F3T03102101012320213012132121333132011>SRR038985.100 VAB_AT1deg1_51_269_F30 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 2319 26 28 9 22 18 21 25 25 23 2 20>SRR038985.200 VAB_AT1deg1_78_430_F30 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 2125 20 28 20 20 15 23 8 25 23 11 25 17
  18. 18. Color Space• ABI SOLiD unique format.• Each number represents two base pair• Each nucleotide are in the SOLiD™ System: the Theory, Advantages and Solutions Color Space Analysis read twice• A spot detection miss may change downstream sequence.• Introduction The SOLiD™ System is the only next generationthis format. Some softwares did not support sequencing system to employ ligation based chemistry 2nd Base with di-base labelled probes. This unique approach provides significant advantages in terms of system 1st Base accuracy and downstream data analysis. T10303011 Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms Detection of more complicated genetic variation TGGCCGGTG such as adjacent SNPs, insertions, deletions and structural variations Double Interrogation: Each base is defined twice T10203011 Properties for a 2 Base Color Code Scheme The color code scheme is based on the Klein four- A T C A A group, which is the symmetry group of a rectangle. ABI White Paper: Figure 1: SOLiD Color Space Code TGGAATTGT It was designed to have the following properties which Color Space Analysis in the SOLiD enable the unique error checking capability. System: the Theory, Advantages and Solutions 18
  19. 19. FASTQ FormatOne read @SRR013343.216 :3:1:837:436 Name GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT Sequence + I6IIII*II*II+I:+&I)I&%&%,+0>+I$G Quality Score @SRR013343.217 :3:1:974:526 GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC + I@II6I<I/III;II+)I*II*DI*I?)+*+8/%8 @SRR013343.218 :3:1:755:341 GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC + IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I 19
  20. 20. PHRED quality encoding −Q Q = −10 log10 P ⇔ P = 10 10• Q=20: 99% accuracy, Q=30: 99.9% accuracy • Quality value scale is slightly different between PHRED and illumina/SOLiD results• Encoded in FASTQ and SAM by quality string of “ASCII value - 33”• For illumina 1.3+, ASCII character has been changed to ASCII-64 character. ! 33 ‘ 39 - 45 3 51 9 57 ? 63 ... “ 34 ( 40 . 46 4 52 : 58 @ 64 ... # 35 ) 41 / 47 5 53 ; 59 A 65 ... $ 36 * 42 0 48 6 54 < 60 B 66 ... % 37 + 43 1 49 7 55 = 61 C 67 ... & 38 , 44 2 50 8 56 > 62 D 68 ... 20
  21. 21. Sequencerʼs Output Sequence Format Genome SequenceMapping Program BWA, Bowtie, etc. Mapping Result Output Format Visualization 21
  22. 22. SAM Format • Sequence Alignment / Map format • Simple tab-delimited text file • Standardized alignment output format • Modern alignment tools support this format • BAM format is binary version of SAM format.@HD VN:1.0@SQ! SN:chr20 LN:62435964@RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891@RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2 22
  23. 23. Overview<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 23
  24. 24. Flag• Bitwise notation: computer friendly (human non- friendly format :)• 16 = 0x0010: mapped reverse strand• 4 = 0x0004: unmapped• 0 = 0x0000: mapped forward strand 24
  25. 25. CIGAR• Show alignment result simply• 8M9I7M • 8bp match, 9bp insertion, and then 7bp match 8M 9I 7M CATATGCG---------ATATGGA |||||||| ||||||| GATGCTAAGCATATGCGAGGCATGCCATATGGATG 4th line “POS” indicates this position. 25
  26. 26. Summary• No standard tools for analyzing NGS data • QA sites are good resources • SeqAnswers.com • biostar.stackexchange.com• Many algorithms and softwares have been developed. • See. http://www.oxfordjournals.org/our_journals/ bioinformatics/nextgenerationsequencing.html• Most of them work with UNIX command line• Few analysis tools with GUI • Galaxy (Free, require server setup) • BioScope (Only available with SOLiD sequencer) 26
  27. 27. Unix Commands Sequencerʼs Output Sequence Format Genome SequencePerformed Mapping Program BWA, Bowtie, etc.with UNIXcommands Mapping Result Output Format Visualization 27
  28. 28. Preparation• NGS procedure generate many files. • Even in this lecture, we will generate 50 files.• We use directory generated by extracting “ngslec.zip.” • Extract the zip file in your home directory.• To move to the directory, we type the following command in Terminal $ cd ngslec $ pwd /Users/YOUR_DIRECTORY/ngslec/ 28
  29. 29. Use “Terminal”• Operating System (OS) handle movements on computer. • Read files, mouse click, visualize characters, ...• We can use the OS functions through application “Terminal” on UNIX OS • Applications > Utilities > Terminal • UNIX: Linux, IBM AIX, Sun OS, Mac OS X • except Windows and Mac OS -9• In the terminal, we can use shell commands.• Applications consists of a procedure of the shell commands. • A complicated program is made of a set of tiny programs. • We start to learn usage of tiny programs, and then how to combine them. Kernel Shell Terminal 29
  30. 30. Command and Arguments $ rm -r arg1 arg2(A) Command (Order): run a command called “rm”(B),(C) and (D) Arguments: separated by space characterbetween command and arguments and between arguments(B) Arguments that change sub functions of the command arecalled “Option.” Options starts from “-” or “--”(C) First argument. We count argument number except options.(D) Second argument. 30
  31. 31. Example: date command• Input “date” + [Return] to show current time• With option “-u”, “date” command shows Coordinated Universal time.• If you misspell command, terminal says “command not found.”• Commands (and file names) are case sensitive on UNIX except Mac OS X. 31
  32. 32. File System• You may always use this system through “Finder.” In this lecture, we will use this from “Terminal.”• Tree structure rooted by “/”• USB memories and DVDs are also managed through file system. / usr Volume bin lib pics USB zurich 32
  33. 33. Directories and Files • Current directory / • Directory on which you are working • You can check “pwd” command. usr Users • Home directory * • Root (top) of your personal directorybin lib sesejun • Denoted by “~” or “$HOME” • When your current directory is “/Users/ usr sesejun” ** • pwd command shows /Users/sesejun lib • /usr/lib indicates * • usr/lib indicates ** • “.” is equal to “/Users/sesejun” • .. is equal to /Users • ../../usr/lib is equal to “/usr/lib” 33
  34. 34. cd: Change Directory• cd destination-dir • move your current directory to destination-dir • When you omit (unset) arguments, move to home dir.jsmbp:~ sesejun$ pwd/Users/sesejunjsmbp:~ sesejun$ cd /usr/jsmbp:/usr sesejun$ pwd/usrjsmbp:/usr sesejun$ cd libjsmbp:/usr/lib sesejun$ pwd/usr/libjsmbp:/usr/lib sesejun$ cd /usr/bin/jsmbp:/usr/bin sesejun$ pwd/usr/binjsmbp:/usr/bin sesejun$ cdjsmbp:~ sesejun$ pwd/Users/sesejunjsmbp:~ sesejun$ 34
  35. 35. ls (LiSt): Show List of Files • Show current directory files when setting no arguments • Important options • -a: Show all files (Files starting from “.” do not appear when we do not set this option) • -l: Show detail information of files • -h: Show file size in human friendly format (usually used with option “-l”) •$ lsDesktop Music largefile$ ls -ldrwx------+ 8 sesejun staff 272 5 16 00:09 Desktopdrwx------+ 3 sesejun staff 102 10 27 2010 Movies-rw-r--r-- 1 sesejun staff 4181139 5 16 08:20 largefile$ ls -lhdrwx------+ 8 sesejun staff 272B 5 16 00:09 Desktopdrwx------+ 3 sesejun staff 102B 10 27 2010 Movies-rw-r--r-- 1 sesejun staff 4.0M 5 16 08:20 largefile 35
  36. 36. cp: Copy Files • cp [options] source-file ... directory • cp [options] source-file new-file • Options: • Copy text1.txt to text2.txt$ cp text1.txt text2.txt • Copy text1.txt and text2.txt in “tmp” directory$ cp text1.txt text2.txt tmp/$ ls tmptext1.txt text2.txt 36
  37. 37. mv: Move files • Also used to change file names • mv [options] source-file ... directory • mv [options] old-path new-path • Change filename text1.txt to text2.txt$ mv text1.txt text2.txt • Move text1.txt and text2.txt into tmp directory$ mv text1.txt text2.txt tmp/$ lstmp$ ls tmp/text1.txt text2.txt 37
  38. 38. rm (ReMove): Delete files• Options: • -r: Remove all the files in directory • -i: Confirm before removing each file.• Delete text1.txt and text2.txtjsmbp:~ sesejun$ rm text1.txt text2.txt• Delete all the files within tmp directory • Note: These files are “really” removed. They never go to “Trash.” We cannot use undo.jsmbp:~/test sesejun$ lstmpjsmbp:~/test sesejun$ ls tmp/text1.txt text2.txtjsmbp:~/test sesejun$ rm -r tmp/jsmbp:~/test sesejun$ lsjsmbp:~/test sesejun$ 38
  39. 39. Exercise (1)• Run commands • Run date and date -u, and check the results. • Run command “cal” What is the result?• Change directory • Run examples in page “cd”• Check make and remove directory • Open your login name directory in Finder. • Move your home directory in Terminal. • Just open terminal. • Run ls and compare the result with Finder result. 39
  40. 40. Note• Commands and messages in Terminal are describes with “Courier Font” • Lines starting from “#” is comment line. You do not need to put them in Terminal. • Lines whose last character is “” continue next line. You put the multiple lines as one line.• You can run commands with “cut and paste.”• To do that, double quotation (“) character make trouble because of difference of character types. Re-inputing double quotation will solve the problem.• Bar (|) can be input by Alt + 7.• In Terminal, you can show history of your commands by pushing up cursor.• “Tab” key may complement your command or filename. 40
  41. 41. cat (conCATenate)• cat [options] file ... $ cat text1.txt How are you ? • Original usage is file $ cat text2.txt Hello! concatenation. Thank you! • Show detail later Good Bye! • Some times this command is used $ cat text1.txt text2.txt How are you ? to show inside of file. Hello! • Options: Thank you! Good Bye! • -n: show line number $ cat -n text2.txt 1 Hello! 2 Thank you! 3 Good Bye! 41
  42. 42. head, tail (Show first or last part of file)• head [-n num] file ... • Show first 10 lines $ cat text2.txt • -n num: show first num lines Hello!• Thank you! tail [-n num] file ... Good Bye! • $ head -n2 text2.txt Show last 10 lines Hello! • -n num: show last num lines Thank you! •by setting +num, you can $ tail -n2 text2.txt Thank you! see file from num-th line to Good Bye! $ tail -n+3 text2.txt last line. Good Bye!• Because of large size of NGS file, these commands are frequently used. • Most editors cannot open NGS 42 files.
  43. 43. less• less <filename>• Show files interactively • Space: Next page • ‘b’: Previous page • ‘q’: Quit • ‘/’ + [word]: search [word] and go to first matched place. The word is highlighted. • To move next place, press ‘n.’• Frequently used to check contents of (large) file like FastA file 43
  44. 44. cut -Show columns-• cut [options] file ... • Show selected columns • Options: • -f <list of nums>: Show <list of nums>-th columns. We can use -d option to set separator between columns. Default separator is “t (Tab).” • -c <list of nums>: Show <list of nums>-th characters. • Examples of “list of nums” • 1,3,5: 1st, 3rd and 5th columns • 1-5: From 1st to 5th columns • 1,3,5-: 1st, 3rd and from 5th to last columns.• This command is also frequently used to handle NGS files. 44
  45. 45. Sort• sort [options] file ... • Arrange file contents in alphabetical order $ cat text2.txt • Hello! Options: Thank you! • Good bye! -r: reverse order $ sort text2.txt • -n: order in numerical value Good bye! Hello! • -k POS: order according to POS-th Thank you! $ sort -r text2.txt column. Default delimiter is “t.” Thank you! We can change it with “-t” option. Hello! Good bye! 45
  46. 46. $ cat nums.tab $ cat nums.tab11.2 13.2 11.2 13.210.9 7.7 10.9 7.715.2 7.0 15.2 7.09.4 10.9 9.4 10.98.8 9.1 8.8 9.1$ cut -f1 nums.tab $ sort -n nums.tab11.2 8.8 9.110.9 9.4 10.915.2 10.9 7.79.4 11.2 13.28.8 15.2 7.0$ cut -f1 -d . nums.tab $ sort -n -k2 nums.tab11 15.2 7.010 10.9 7.715 8.8 9.19 9.4 10.98 11.2 13.2$ cut -c1-3 nums.tab $ sort nums.tab11. 10.9 7.710. 11.2 13.215. 15.2 7.09.4 8.8 9.18.8 9.4 10.9 46
  47. 47. Exercise (2)• Generate two files “test1.txt” and “test2.txt”• Run cat, head and tail command according to examples.• Generate file “nums.txt” • Character between numbers (columns) is “tab.”• Test cut and sort commands according to examples. 47
  48. 48. Redirect (>)• command > file • Save command result into “file.” • Overwrite contents of file. • The following command save the result of “sort -n nums.tab” into “nums_sort.tab”• command >> file • Add command result to “file.” $ sort -n nums.tab > nums_sort.tab $ sort -n nums.tab >> nums_sort.tab 48
  49. 49. Pipe (|) • command1 | command2 • Run command2 with command1’s result$ sort -n nums.tab8.8 9.19.4 10.910.9 7.711.2 13.215.2 7.0$ sort -n nums.tab | cat -n 1 8.8 9.1 2 9.4 10.9 3 10.9 7.7 4 11.2 13.2 5 15.2 7.0$ sort -n nums.tab | cat -n | head -n2 1 8.8 9.1 2 9.4 10.9$ sort -n nums.tab | cat -nproduces the same result as$ sort -n nums.tab > nums_sort.tab 49$ cat -n nums_sort.tab
  50. 50. Commands used with pipe • sort, cut • less • wc [options] file... • Word Count • Show number of lines, words and characters.$ sort nums.tab | less$ wc nums.tab 5 10 45 nums.tab #lines #words #chrs$ wc -l nums.tab 5 nums.tab Show only number of lines 50
  51. 51. gzip and bzip2• Source codes and sample datasets are provided with tar and gzip/bzip2 file. • Only gzip/bzip2 is used for single file.• “tar” can generate single file containing files and folders.• gzip/bzip2 can compress file • gzip is the most frequently used. bzip2 file size is smaller than gzip.$ ls -lh chr21.fa.gz-rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz$ gzip -d chr21.fa.gz Decompress hs_ref_chr21.fa.gz and generate hs_ref_chr21.fa.$ ls -lh chr21.fa-rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa$ gzip chr21.fa Compress$ ls -lh chr21.fa.bz2 51-rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2
  52. 52. tar (Tape ARchive)• Generate single file containing files and folders.• Frequently used with gzip/bzip2• Remember the following idioms! • We will use this to install programs to analyze NGS data.with gzip1. $ gzip -dc file.tar.gz | tar xvf -2. $ tar zxvf file.tar.gzwith bzip21. $ bzip2 -dc file.tar.bz2 | tar xvf - Tar has no option to decompress bzip2. 52
  53. 53. grep (g/re/p) grep [options] file ... $ cat nums.tab• Print lines matching pattern 11.2 10.9 13.2 7.7• Options: 15.2 7.0 • -v: print non-matching lines 9.4 8.8 10.9 9.1 • -e <regular expression>: select line $ grep “7” nums.tab with regular expression 10.9 7.7• 15.2 7.0 Regular expression $ grep -v “7” nums.tab • Specific pattern to express 11.2 9.4 13.2 10.9 character sequence 8.8 9.1 • ^: The beginning of line $ grep -e "^1" nums.tab • $: The end of line 11.2 10.9 13.2 7.7 • Supported by most programming 15.2 7.0 languages. Very useful to handle various formats including DNA/ Protein sequence. 53
  54. 54. Exercise (3) • Use “TAIR10_chr1.fas” • A.thaliana chromosome 1 sequence • Select annotation line from FASTA format. • FASTA format • Line starting from “>” is annotation of sequence. • The following lines of the annotation contains nucleotide or amino acid sequence. • To select an annotation, select lines starting from “>” • Count number of nucleotides in (Multi) FASTA format • Lines including nucleotides do not start from “>” • Number of nucleotides = number of characters • Use “wc” command • Note that the end of line contains “Return” character>gi|29028877|gb|BT005883|U23535ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCA 54TCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTCCACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT
  55. 55. Let’s start NGS analysis! • Dataset • TAIR 10 genome (A.thaliana) • 1/100 scale SOLiD RNA-Seq reads sets • Filenames: tha_reads.csfasta & tha_reads_QV.qual • SRR038985: 41,117,124 reads, 1,439,099,340 bp • http://trace.ddbj.nig.ac.jp/DRASearch/experiment? acc=SRX018529 • Filenames: lyr_reads.csfasta & lyr_reads_QV.qual • SRR038987: 41,340,154 reads, 1,446,905,390 bp • http://trace.ddbj.nig.ac.jp/DRASearch/experiment? acc=SRX018531 • 1/10 scale Roche 454 Read Set (SRR020799)$ grep -e “^>” tha_reads.csfasta | wc -l 55411171
  56. 56. Installing BWA • In this lecture, because our computer do not have “gcc” command to compile C language, we skip this procedure. • Download BWA • http://bio-bwa.sourceforge.net/ • bwa-0.5.8c.tar.bz2 exists in USB. Copy the file. • Extract the file • Move into BWA directory • Compile source programs • Make alias name “bwa” for bwa-0.5.8c directory# $ curl -O # http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2# $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf -# ...filenames...# $ ln -s bwa-0.5.8c bwa # Simplify the directory name# $ cd bwa# $ make# ...compile messages...# $ cd .. # back to working directory 56
  57. 57. Prepare A.thaliana Genome• Download chromosomes from TAIR site • http://www.arabidopsis.org/ • Find URLs by selecting “Download” tab > Sequences > whole_chromosomes • Each file includes one chromosome on current version. • TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas, TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas, TAIR10_chrM.fas • Because of limited server and network capacity, distributed these files with USB or web site for this lecture.• Concatenate these chromosomes except chloroplast and mitochondria into single file 57
  58. 58. # We skip this process#$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/TAIR10_chr[1-5].fas”## 1-5 means consecutive numbers from 1 to 5.## We do not use chroloplast and mitochondria genomes.# Instead of the download, we use the files in USB.# The files are in your working directory.# Check it by below command.$ ls TAIR10*TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fasTAIR10_chr2.fas TAIR10_chr4.fas# Concatinate all chromosomes into single file$ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fasTAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas# Check the result$ grep -e “^>” TAIR10_chr_all.fas>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated:2009-02-02>Chr2...# You can find 5 chromosomes’ annotations 58
  59. 59. Run BWA • Make index on genome sequence • For SOLiD reads, “-c” option is required. • This process needs just once as long as you use the same genome (do not depend on read sequences). • Convert reads’ colorspace into BWA specific format • You don’t need this process for illumina reads. • Illumina sequencers produce FastQ format files, and most alignment software can handle that directly. • Mapping reads against genome sequence • If you use illumina, -I option may be required. Check your illumina version. • Above two processes may take long time. This lecture’s toy data is 1/100 scale. For real data will require more than two hours.$ ./bwa/bwa index -c TAIR10_chr_all.fas# running messages. Takes more than 3 mins.$ python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa$ ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai# messages...about 1min. Alignment phase. 59
  60. 60. Run BWA (continued) • Convert mapping result into SAM format. • You have to use “sampe” instead of “samse” for paired end experiment to put mate pair information into SAM format. • That’s all! Check the contents of sam file with less command. • How many reads can be mapped against genome?$ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa >tha_reads.sam# messages. Generate summary of alignment.# If you have paired ended reads, you can use sampe instead of samse.$ less tha_reads.sam# Press “q” to quit less command.# Next page is “space” 60
  61. 61. Inside of SAM file Chromosome (Mapped database) information@SQ SN:Chr1 LN:30427671 Used program and its variables@SQ SN:Chr2 LN:19698289@SQ SN:Chr3 LN:23459830@SQ SN:Chr4 LN:18585056 Mapped read in forward@SQ SN:Chr5 LN:26975502@PG ID:bwa PN:bwa VN:0.5.9-r16 direction on Chr5SRR038985.100 0 Chr5 22828962 37 33M *0 0 GCCGGTGATGTAATCAAAATATTTGCTACTCTT WZYTWWTW]YVUOW]OEKNUUX]PJSRY][63 XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33SRR038985.200 0 Chr3 14197678 0 33M *0 0 ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG X]]KN]]YWUX]XIKYRCHSUYX[[SNQJL[MO XT:A:R CM:i:0 X0:i:2 X1:i:0XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0;SRR038985.300 4 * 0 0 * * 00 AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT 124,/08/5&6-&,(;/4+%7,+5.:1,*;8:& 61 Unmapped read
  62. 62. Exercise (4)• Run BWA• Compare file size of csfasta + qual files with generated SAM file. • Which is larger? How much disk space we need to analyze?• Check the details of SAM file • Format details are described in http:// samtools.sourceforge.net/SAM1.pdf• How many reads are mapped onto chromosomes. • Select lines containing “Chr” # use grep • Then, count the number of lines # use wc• Calculate ratio of mapped reads to total reads. 62
  63. 63. Problems• Mapped read ratio may be very lower than expected. • Genome quality is (probably) high.• Various problems • Wet problems • Protocols and reagents • Mitochondria and chroloplast. • Dry problems • We used all sequences. We may need to remove low quality reads. • Sequence quality of 3’-end is low. We might trim these sequence. • We did not care about reads on splice junction. • We did not change any parameters in BWA. The parameter might not be suitable for our reads. • No one has versatile result.• Note!!! mapped ratio of current RNA-Seq reads is (extremely) 63 higher than this result.

×