SlideShare a Scribd company logo
1 of 63
Download to read offline
Next Generation Sequencing for
Model and Non-Model Organism.
           1st day
       Jun Sese and Kentaro Shimizu
          sesejun@cs.titech.ac.jp

       Ph.D course @ Univ. of Zurich
               25/05/2011
Today’s Menu
•   Lecture
    •   Overview of next generation sequencer’s analysis
    •   Mapping: Sequence alignment
    •   Introduction to UNIX to handle NGS data
•   Exercise
    •   UNIX commands
    •   Mapping real short reads against genomes
    •   Compute statistics of the mapped reads




                                                           2
Various Types of Sequencers
•   Roche 454, IonTorrent
    •   Roche: about 400bp, Ion Torrent: about 200bp
    •   Suitable for denovo sequencing
•   Illumina HiSeq
    •   Widely-used new generation sequencer
    •   100bpx2 up to 600 Gb/run (HiSeq 2000)
    •   MiSeq uses almost same technology except number
        of reads
•   ABI SOLiD
    •   75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run
        (5500xl SOLiD)
    •   Color Space
•   Pacific Biosciences PacBio RS
    •   Average > 500 bp
    •   Sequence quality is not high.
                                                          3
Sequence cost becomes low
       dramatically




              Lincoln Stein, Genome Biology, vol. 11(5), 2010
                                                                4
How large is it?

•   Generated file size is more than 300GB/run
•   We can read data from hard disks with 100 MB/sec
•   300GB / 100MB/sec
          = 300,000MB / 100MB/sec
          = 3000 sec
          = 50min
•   To just read the data from HDD, computer takes 50min!
      •   Require efficient calculation



                                                            5
Applications of DNA Sequencing
 •   NGS just read enormous short sequences, but has
     many biological applications.
 •   Genetic variation
 •   Gene regulations
     •   RNA-seq
     •   ChIP-seq
 •   Epigenetics
 •   Population genetics




                                                   Science 2007   6
Sequencerʼs Output


                      Genome Sequence

Mapping Program



 Mapping Result



  Visualization       Further Analysis
                     SNPs, RNA-Seq,...   7
Major Pipelines of NGS
   •       Most of the applications use the similar procedure.


                Genetic variation        RNA-Seq             ChIP-Seq
   Find
originated            Map                  Map                   Map
  region          (Alignment)
                                                        Check regulatory
  Filter            SNP call        Measure expressions     regions



Analysis         Find difference     Same as microarray   Same as ChIP-
                                                           Chip analysis

           Most of them require whole genome sequence to map reads.
                                                                           8
Mapping (Pairwise Alignment)
  •   Find the place from which each read comes
      •  BLAST is one of the very famous alignment software.
      •  Few NGS analysis use BLAST/BLAT because of slow alignment
         speed.
      •  BWA and Bowtie have been used to map short reads.
  Reads                   ATATGCGA

                               ATATGCGA
Reference         GATGCTAAGCATATGCGAGGCATGCCATATGGATG
We may find multiple mapped places.
Score matrix (distance) defines which map is better.

  Reads                   ATATGCGA

                            ATATGCGA        ATATG-CGA
                             x
 Reference        GATGCTAAGCAAATGCGAGGCATGCCATATGGCGA                9
10
For non-model organism
             Genetic Variation      Chip-Seq                 RNA-Seq
                                                     Read normalized
           Read genome            Read genome            library
Genome/Gene
 Sequence    Genome                 Genome                 RNA
                assembly            assembly             Assembly
                                                                      Map onto

             Map new reads Map ChIP-Seq
                                                                   related species
   Map                                                Count            genome
                              reads                 assembled
                                                      reads   Map new
                                                            RNA-Seq reads
                                 Check regulatory
   Filter        SNP call
                                     regions           Measure expressions


                                 Similar to
  Analysis    Find Difference                     Same as microarray
                                ChIP-Chip
                       Most cases require genome assembly,
               which is experimentally and computationally high cost 11
Very Short History of
        Pairwise Alignment Programs
•   More than 100 alignment programs are listed in Wikipedia!!!
    •   http://en.wikipedia.org/wiki/Sequence_alignment_software
•   1 sequence vs 1 sequence
    •   Ssearch, FASTA [Lipman and Pearson. 1985]
•   1 sequence vs Whole genes
    •   BLAST [Altschul et al. 1990]
•   Thousands of sequences vs Whole genes or Whole genomes
    •   BLAT [Kent. 2002]
•   Billions of short sequences vs Whole genome
    •   BWA, Bowtie, SHRiMP, etc...
        •   Most modern mappers use FM-index [Ferragina and
            Manzini. 2000] with Burrows-Wheeler transform [Burrows
            and Wheeler. 1994].                                   12
Why so many alignment
 programs have been developed?
• Computer scientist seems that alignment is easy task.
  • Both indexing and dynamic programming used in
        sequence alignment are basic algorithm.
    •   Good problem for home work
    •   A little performance tuning can accelerates execution
        speed dramatically
•   In reality, alignment problem is very hard to solve.
    •   Mutations, insertions, deletions...
    •   Each sequencer has unique bias.
        •   Sequence length. Homo-polymer in Roche 454...
    •   Many heuristics exist in biologist!
        •   GT-AG rule on splice site, but not always...
    •   That is, problem definition is ambiguous!                13
Alignment performance varies
•   Aligned 12million single end reads against human genome
    sequences (hg18)
•   Algorithm and implementation difference appear in total processed
    time
  •      In most program, used memory depends on genome size.
•   Parameter settings reflect numbers of mapped reads.
  •      Authors did not mention about them.
  •      In real experiments, we have to change parameters to use
         alignment program.

Bao et al. J Hum Genet, 2011




                                                                    14
Sequencerʼs Output
         Sequence Format

                      Genome Sequence

Mapping Program      BWA, Bowtie, etc.



 Mapping Result



  Visualization
                                         15
Sequence File Format (1)
               •    FASTA + Quality File
                   •  Used by Roche 454
>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_
GCGTTGTGTATGTCTCCTTTGGTATGTCAGGTTTCGTCAGAAGCTTCTATCAAACGGCGC
ACAGTGA
>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_
TCGGCCCTATCCGAGAAGGCGTGGTGTATCTCTCTTCTGGTATGCCACGTTACGCAGCAG
CTTCTTCCCAAGACACAGAGCGAGTAAG




>1ST_SEQ   length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_
37 35 35   35 35 35 37 37 37 37 37 39 39 37 36 35 35 36 37 37 37 37 35   35 32 28 27 27 27 27
29 23 21   21 14 14 12 18 19 19 19 19 19 19 16 16 17 20 22 20 12 12 12   12 11 17 17 17 16 19
22 23 24   21 21 21 18
>2ND_SEQ   length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_
29 30 19   19 19 20 19 24 28 27 27 27 27 27 30 19 19 20 20 20 24 33 33   33 33 33 33 33 35 35
37 37 30   30 30 30 32 32 32 32 35 32 32 32 32 33 33 33 33 20 20 20 23   27 30 30 31 31 27 27
27 27 28   23 24 24 23 23 23 24 24 21 17 19 19 18 27 18 17 16 16 16 17   13 18 17 16 12




                                                                                                16
Sequence File Format (2)
 •    FASTQ
     •  Used by Illumina sequencers
     •  Sequence database sites (SRA(Short read archive)/ENA
        (European Nucleotide Archive)/DRA(DDBJ Sequence Read
        Archive)) provide sequences with this format.
   •    De-facto standard
 •    CSFasta + Quality file
   •    Only used in SOLiD sequencers
   •    Similar to fasta file except sequences are described in color
        space.
>SRR038985.100 VAB_AT1deg1_51_269_F3
T10303011231130321000333001323122221
>SRR038985.200 VAB_AT1deg1_78_430_F3
T03102101012320213012132121333132011

>SRR038985.100 VAB_AT1deg1_51_269_F3
0 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 23
19 26 28 9 22 18 21 25 25 23 2 20
>SRR038985.200 VAB_AT1deg1_78_430_F3
0 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 21
25 20 28 20 20 15 23 8 25 23 11 25                                   17
Color Space
•     ABI SOLiD unique format.
•     Each number represents two base pair
•     Each nucleotide are in the SOLiD™ System: the Theory, Advantages and Solutions
          Color Space Analysis read twice
•     A spot detection miss may change downstream sequence.
•   Introduction
    The SOLiD™ System is the only next generationthis format.
      Some softwares did not support
    sequencing system to employ ligation based chemistry
                                                                2nd Base



    with di-base labelled probes. This unique approach
    provides significant advantages in terms of system




                                                               1st Base
    accuracy and downstream data analysis.
       T10303011
            Unique built-in error checking capability
        distinguishes between measurement errors and
        true polymorphisms
            Detection of more complicated genetic variation
       TGGCCGGTG
        such as adjacent SNPs, insertions, deletions and
        structural variations                                  Double Interrogation: Each base is defined twice


       T10203011
    Properties for a 2 Base Color Code Scheme
    The color code scheme is based on the Klein four-
                                                                A         T        C        A A
    group, which is the symmetry group of a rectangle.
                                          ABI White Paper: Figure 1: SOLiD Color Space Code
       TGGAATTGT
    It was designed to have the following properties which Color Space Analysis in the SOLiD
    enable the unique error checking capability.
                                          System: the Theory, Advantages and Solutions
                                                                                                             18
FASTQ Format
One read
 @SRR013343.216 :3:1:837:436               Name
 GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT      Sequence
 +
 I6IIII*II*II+I:+&I)I'&%&%,+0>+'I''$G   Quality Score
 @SRR013343.217 :3:1:974:526
 GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC
 +
 I@II6I<I/III;II+)I*II*DI*I?')+*+8/%8
 @SRR013343.218 :3:1:755:341
 GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC
 +
 IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I'

                                                    19
PHRED quality encoding
                                                     −Q
         Q = −10 log10 P ⇔ P = 10                    10


•   Q=20: 99% accuracy, Q=30: 99.9% accuracy
    •  Quality value scale is slightly different between PHRED
       and illumina/SOLiD results
•   Encoded in FASTQ and SAM by quality string of “ASCII
    value - 33”
•   For illumina 1.3+, ASCII character has been changed to
    ASCII-64 character.

    !   33   ‘   39   -   45   3   51   9   57   ?   63   ...
    “   34   (   40   .   46   4   52   :   58   @   64   ...
    #   35   )   41   /   47   5   53   ;   59   A   65   ...
    $   36   *   42   0   48   6   54   <   60   B   66   ...
    %   37   +   43   1   49   7   55   =   61   C   67   ...
    &   38   ,   44   2   50   8   56   >   62   D   68   ...
                                                                 20
Sequencerʼs Output
         Sequence Format


                      Genome Sequence

Mapping Program      BWA, Bowtie, etc.



 Mapping Result Output Format



  Visualization
                                         21
SAM Format

    •   Sequence Alignment / Map format
        •  Simple tab-delimited text file
    •   Standardized alignment output format
    •   Modern alignment tools support this format
    •   BAM format is binary version of SAM format.


@HD VN:1.0
@SQ! SN:chr20 LN:62435964
@RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
@RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 
    AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< 
    NM:i:1 RG:Z:L1
read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168 
    ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< 
    MF:i:18 RG:Z:L2
                                                                                22
Overview
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> 
<ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]

read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 
AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< 
NM:i:1 RG:Z:L1




                                                                            23
Flag




•   Bitwise notation: computer friendly (human non-
    friendly format :)
•   16 = 0x0010: mapped reverse strand
•   4 = 0x0004: unmapped
•   0 = 0x0000: mapped forward strand

                                                      24
CIGAR




•   Show alignment result simply
•   8M9I7M
    • 8bp match, 9bp insertion, and then 7bp match

                     8M     9I      7M
                  CATATGCG---------ATATGGA
                  ||||||||         |||||||
         GATGCTAAGCATATGCGAGGCATGCCATATGGATG


                  4th line “POS” indicates this position.   25
Summary
•   No standard tools for analyzing NGS data
    •  QA sites are good resources
        • SeqAnswers.com
        • biostar.stackexchange.com
•   Many algorithms and softwares have been
    developed.
    •  See. http://www.oxfordjournals.org/our_journals/
       bioinformatics/nextgenerationsequencing.html
•   Most of them work with UNIX command line
•   Few analysis tools with GUI
    •  Galaxy (Free, require server setup)
    •  BioScope (Only available with SOLiD sequencer)

                                                          26
Unix Commands
            Sequencerʼs Output
                      Sequence Format

                                   Genome Sequence

Performed   Mapping Program      BWA, Bowtie, etc.
with UNIX
commands
             Mapping Result      Output Format



              Visualization
                                                     27
Preparation
•   NGS procedure generate many files.
    •  Even in this lecture, we will generate 50 files.
•   We use directory generated by extracting “ngslec.zip.”
    •  Extract the zip file in your home directory.
•   To move to the directory, we type the following command
    in Terminal


        $ cd ngslec
        $ pwd
        /Users/YOUR_DIRECTORY/ngslec/




                                                              28
Use “Terminal”
•   Operating System (OS) handle movements on computer.
    •  Read files, mouse click, visualize characters, ...
•   We can use the OS functions through application “Terminal” on
    UNIX OS
    •  Applications > Utilities > Terminal
    •  UNIX: Linux, IBM AIX, Sun OS, Mac OS X
        • except Windows and Mac OS -9
•   In the terminal, we can use shell commands.
•   Applications consists of a procedure of the shell commands.
    •  A complicated program is made of a set of tiny programs.
    •  We start to learn usage of tiny programs, and then how to
       combine them.

        Kernel           Shell     Terminal
                                                                    29
Command and Arguments
         $ rm -r arg1 arg2



(A) Command (Order): run a command called “rm”
(B),(C) and (D) Arguments: separated by space character
between command and arguments and between arguments
(B) Arguments that change sub functions of the command are
called “Option.” Options starts from “-” or “--”
(C) First argument. We count argument number except options.
(D) Second argument.
                                                               30
Example: date command
•   Input “date” + [Return] to show current time

•   With option “-u”, “date” command shows
    Coordinated Universal time.
•   If you misspell command, terminal says “command
    not found.”
•   Commands (and file names) are case sensitive on
    UNIX except Mac OS X.




                                                      31
File System
•   You may always use this system through “Finder.” In this lecture,
    we will use this from “Terminal.”
•   Tree structure rooted by “/”
•   USB memories and DVDs are also managed through file system.


                                          /


                                   usr           Volume


                             bin         lib      pics


                                         USB     zurich
                                                                   32
Directories and Files
                        •   Current directory
         /                  • Directory on which you are working
                            • You can check “pwd” command.
  usr         Users     •   Home directory
          *                 • Root (top) of your personal directory
bin     lib   sesejun
                            • Denoted by “~” or “$HOME”
                        •   When your current directory is “/Users/
               usr          sesejun”
                  **        • pwd command shows /Users/sesejun
               lib
                            • /usr/lib indicates *
                            • usr/lib indicates **
                            •   “.” is equal to “/Users/sesejun”
                            •   .. is equal to /Users
                            •   ../../usr/lib is equal to “/usr/lib”
                                                                  33
cd: Change Directory
• cd destination-dir
 • move your current directory to destination-dir
 • When you omit (unset) arguments, move to home
     dir.

jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$ cd /usr/
jsmbp:/usr sesejun$ pwd
/usr
jsmbp:/usr sesejun$ cd lib
jsmbp:/usr/lib sesejun$ pwd
/usr/lib
jsmbp:/usr/lib sesejun$ cd /usr/bin/
jsmbp:/usr/bin sesejun$ pwd
/usr/bin
jsmbp:/usr/bin sesejun$ cd
jsmbp:~ sesejun$ pwd
/Users/sesejun
jsmbp:~ sesejun$                                    34
ls (LiSt): Show List of Files
  •   Show current directory files when setting no arguments
  •   Important options
      • -a: Show all files (Files starting from “.” do not appear
        when we do not set this option)
      • -l: Show detail information of files
      • -h: Show file size in human friendly format (usually used
        with option “-l”)
      •
$ ls
Desktop                   Music                 largefile
$ ls -l
drwx------+   8 sesejun    staff        272 5 16 00:09 Desktop
drwx------+   3 sesejun    staff        102 10 27 2010 Movies
-rw-r--r--    1 sesejun    staff    4181139 5 16 08:20 largefile
$ ls -lh
drwx------+   8 sesejun    staff       272B 5 16 00:09 Desktop
drwx------+   3 sesejun    staff       102B 10 27 2010 Movies
-rw-r--r--    1 sesejun    staff       4.0M 5 16 08:20 largefile
                                                                   35
cp: Copy Files
 •    cp [options] source-file ... directory
 •    cp [options] source-file new-file
 •   Options:
 •   Copy text1.txt to text2.txt
$ cp text1.txt text2.txt

 •   Copy text1.txt and text2.txt in “tmp” directory

$ cp text1.txt text2.txt tmp/
$ ls tmp
text1.txt       text2.txt




                                                       36
mv: Move files
    •   Also used to change file names
    •    mv [options] source-file ... directory
    •    mv [options] old-path new-path
    •   Change filename text1.txt to text2.txt
$ mv text1.txt text2.txt



    •   Move text1.txt and text2.txt into tmp directory

$ mv text1.txt text2.txt tmp/
$ ls
tmp
$ ls tmp/
text1.txt       text2.txt



                                                          37
rm (ReMove): Delete files
•   Options:
    • -r: Remove all the files in directory
    • -i: Confirm before removing each file.
•   Delete text1.txt and text2.txt
jsmbp:~ sesejun$ rm text1.txt text2.txt


•   Delete all the files within tmp directory
    • Note: These files are “really” removed. They never
      go to “Trash.” We cannot use undo.
jsmbp:~/test   sesejun$ ls
tmp
jsmbp:~/test   sesejun$ ls tmp/
text1.txt         text2.txt
jsmbp:~/test   sesejun$ rm -r tmp/
jsmbp:~/test   sesejun$ ls
jsmbp:~/test   sesejun$                                   38
Exercise (1)
•   Run commands
    • Run date and date -u, and check the results.
    • Run command “cal” What is the result?
•   Change directory
    • Run examples in page “cd”
•   Check make and remove directory
    • Open your login name directory in Finder.
    • Move your home directory in Terminal.
        •
        Just open terminal.
    • Run ls and compare the result with Finder result.



                                                          39
Note
•   Commands and messages in Terminal are describes with
    “Courier Font”
    •  Lines starting from “#” is comment line. You do not
       need to put them in Terminal.
    •  Lines whose last character is “” continue next line.
       You put the multiple lines as one line.
•   You can run commands with “cut and paste.”
•   To do that, double quotation (“) character make trouble
    because of difference of character types. Re-inputing
    double quotation will solve the problem.
•   Bar (|) can be input by Alt + 7.
•   In Terminal, you can show history of your commands by
    pushing up cursor.
•   “Tab” key may complement your command or filename.          40
cat (conCATenate)
•       cat [options] file ...            $ cat text1.txt
                                          How are you ?
    •   Original usage is file             $ cat text2.txt
                                          Hello!
        concatenation.                    Thank you!
        •  Show detail later              Good Bye!

    •   Some times this command is used
                                          $ cat text1.txt text2.txt
                                          How are you ?
        to show inside of file.            Hello!

    •   Options:                          Thank you!
                                          Good Bye!
        •  -n: show line number           $ cat -n text2.txt
                                               1 Hello!
                                               2 Thank you!
                                               3 Good Bye!




                                                                      41
head, tail (Show first or last
                  part of file)
•       head [-n num] file ...
    •    Show first 10 lines             $ cat text2.txt
    •    -n num: show first num lines    Hello!

•
                                        Thank you!
        tail [-n num] file ...          Good Bye!

    •
                                        $ head -n2 text2.txt
      Show last 10 lines                Hello!
    • -n num: show last num lines       Thank you!

        •by setting +num, you can
                                        $ tail -n2 text2.txt
                                        Thank you!
         see file from num-th line to    Good Bye!
                                        $ tail -n+3 text2.txt
         last line.                     Good Bye!
•   Because of large size of NGS file,
    these commands are frequently
    used.
    • Most editors cannot open NGS
                                                                42
      files.
less
• less      <filename>


•   Show files interactively
    • Space: Next page
    • ‘b’: Previous page
    • ‘q’: Quit
    • ‘/’ + [word]: search [word] and go to first matched
      place. The word is highlighted.
        • To move next place, press ‘n.’
•   Frequently used to check contents of (large) file like
    FastA file



                                                            43
cut -Show columns-
•       cut [options] file ...

    •   Show selected columns
    •   Options:
        •   -f <list of nums>: Show <list of nums>-th columns. We
            can use -d option to set separator between columns. Default
            separator is “t (Tab).”
        •   -c <list of nums>: Show <list of nums>-th characters.
        •   Examples of “list of nums”
            •   1,3,5: 1st, 3rd and 5th columns
            •   1-5: From 1st to 5th columns
            •   1,3,5-: 1st, 3rd and from 5th to last columns.
•   This command is also frequently used to handle NGS files.        44
Sort
•       sort [options] file ...

    •   Arrange file contents in alphabetical
        order                                    $ cat text2.txt

    •
                                                 Hello!
        Options:                                 Thank you!

        •
                                                 Good bye!
            -r: reverse order                    $ sort text2.txt

        •   -n: order in numerical value
                                                 Good bye!
                                                 Hello!

        •   -k POS: order according to POS-th    Thank you!
                                                 $ sort -r text2.txt
            column. Default delimiter is “t.”   Thank you!
            We can change it with “-t” option.   Hello!
                                                 Good bye!




                                                                       45
$ cat   nums.tab            $ cat nums.tab
11.2       13.2             11.2     13.2
10.9       7.7              10.9     7.7
15.2       7.0              15.2     7.0
9.4        10.9             9.4      10.9
8.8        9.1              8.8      9.1
$ cut   -f1 nums.tab        $ sort -n nums.tab
11.2                        8.8      9.1
10.9                        9.4      10.9
15.2                        10.9     7.7
9.4                         11.2     13.2
8.8                         15.2     7.0
$ cut   -f1 -d . nums.tab   $ sort -n -k2 nums.tab
11                          15.2     7.0
10                          10.9     7.7
15                          8.8      9.1
9                           9.4      10.9
8                           11.2     13.2
$ cut   -c1-3 nums.tab      $ sort nums.tab
11.                         10.9     7.7
10.                         11.2     13.2
15.                         15.2     7.0
9.4                         8.8      9.1
8.8                         9.4      10.9
                                                     46
Exercise (2)
•   Generate two files “test1.txt” and “test2.txt”
•   Run cat, head and tail command according to
    examples.
•   Generate file “nums.txt”
    •   Character between numbers (columns) is “tab.”
•   Test cut and sort commands according to examples.




                                                        47
Redirect (>)
•   command > file
    • Save command result into “file.”
      •  Overwrite contents of file.
    • The following command save the result of “sort -n nums.tab”
      into “nums_sort.tab”
•   command >> file
    • Add command result to “file.”




        $ sort -n nums.tab > nums_sort.tab
        $ sort -n nums.tab >> nums_sort.tab


                                                                48
Pipe (|)
 •   command1 | command2
     • Run command2 with command1’s result
$ sort -n nums.tab
8.8      9.1
9.4      10.9
10.9     7.7
11.2     13.2
15.2     7.0
$ sort -n nums.tab | cat -n
     1   8.8     9.1
     2   9.4     10.9
     3   10.9    7.7
     4   11.2    13.2
     5   15.2    7.0
$ sort -n nums.tab | cat -n | head -n2
     1   8.8     9.1
     2   9.4     10.9

$ sort -n nums.tab | cat -n
produces the same result as
$ sort -n nums.tab > nums_sort.tab           49
$ cat -n nums_sort.tab
Commands used with pipe
    •   sort, cut
    •   less
    •   wc [options] file...
        •  Word Count
        •  Show number of lines, words and characters.


$ sort nums.tab | less
$ wc nums.tab
        5      10    45 nums.tab
      #lines  #words #chrs
$ wc -l nums.tab
        5 nums.tab                 Show only number of lines




                                                               50
gzip and bzip2
•   Source codes and sample datasets are provided with tar and
    gzip/bzip2 file.
    •  Only gzip/bzip2 is used for single file.
•   “tar” can generate single file containing files and folders.
•   gzip/bzip2 can compress file
    •  gzip is the most frequently used. bzip2 file size is smaller
       than gzip.

$ ls -lh chr21.fa.gz
-rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz
$ gzip -d chr21.fa.gz                   Decompress hs_ref_chr21.fa.gz and
                                            generate hs_ref_chr21.fa.
$ ls -lh chr21.fa
-rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa
$ gzip chr21.fa                         Compress


$ ls -lh chr21.fa.bz2                                                       51
-rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2
tar (Tape ARchive)
•   Generate single file containing files and folders.
•   Frequently used with gzip/bzip2
•   Remember the following idioms!
    • We will use this to install programs to analyze NGS data.


with gzip
1. $ gzip -dc file.tar.gz | tar xvf -

2. $ tar zxvf file.tar.gz


with bzip2
1. $ bzip2 -dc file.tar.bz2 | tar xvf -


                Tar has no option to decompress bzip2.

                                                                  52
grep (g/re/p)
        grep [options] file ...               $ cat nums.tab

•   Print lines matching pattern              11.2
                                              10.9
                                                       13.2
                                                       7.7
•   Options:                                  15.2     7.0
    •  -v: print non-matching lines           9.4
                                              8.8
                                                       10.9
                                                       9.1
    •  -e <regular expression>: select line   $ grep “7” nums.tab
       with regular expression                10.9     7.7

•
                                              15.2     7.0
    Regular expression                        $ grep -v “7” nums.tab
    •  Specific pattern to express             11.2
                                              9.4
                                                       13.2
                                                       10.9
       character sequence                     8.8      9.1
        • ^: The beginning of line            $ grep -e "^1" nums.tab

        • $: The end of line
                                              11.2
                                              10.9
                                                       13.2
                                                       7.7
    •  Supported by most programming          15.2     7.0
       languages. Very useful to handle
       various formats including DNA/
       Protein sequence.
                                                                        53
Exercise (3)
       • Use “TAIR10_chr1.fas”
        • A.thaliana chromosome 1 sequence
       • Select annotation line from FASTA format.
        • FASTA format
          • Line starting from “>” is annotation of sequence.
          • The following lines of the annotation contains
              nucleotide or amino acid sequence.
        • To select an annotation, select lines starting from “>”
       • Count number of nucleotides in (Multi) FASTA format
        • Lines including nucleotides do not start from “>”
        • Number of nucleotides = number of characters
          • Use “wc” command
        • Note that the end of line contains “Return” character
>gi|29028877|gb|BT005883|U23535
ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCA
                                                                             54
TCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTC
CACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT
Let’s start NGS analysis!
 •   Dataset
     • TAIR 10 genome (A.thaliana)
     • 1/100 scale SOLiD RNA-Seq reads sets
         •
         Filenames: tha_reads.csfasta & tha_reads_QV.qual
             •
            SRR038985: 41,117,124 reads, 1,439,099,340 bp
                 •
              http://trace.ddbj.nig.ac.jp/DRASearch/experiment?
              acc=SRX018529
         •
         Filenames: lyr_reads.csfasta & lyr_reads_QV.qual
             •
            SRR038987: 41,340,154 reads, 1,446,905,390 bp
                 •
              http://trace.ddbj.nig.ac.jp/DRASearch/experiment?
              acc=SRX018531
     • 1/10 scale Roche 454 Read Set (SRR020799)


$ grep -e “^>” tha_reads.csfasta | wc -l
                                                                  55
411171
Installing BWA
    •   In this lecture, because our computer do not have “gcc”
        command to compile C language, we skip this procedure.
    •   Download BWA
        •  http://bio-bwa.sourceforge.net/
        •  bwa-0.5.8c.tar.bz2 exists in USB. Copy the file.
    •   Extract the file
    •   Move into BWA directory
    •   Compile source programs
    •   Make alias name “bwa” for bwa-0.5.8c directory
#   $ curl -O 
#   http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2
#   $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf -
#   ...filenames...
#   $ ln -s bwa-0.5.8c bwa # Simplify the directory name
#   $ cd bwa
#   $ make
#   ...compile messages...
#   $ cd .. # back to working directory                               56
Prepare A.thaliana Genome
•   Download chromosomes from TAIR site
    •   http://www.arabidopsis.org/
    •   Find URLs by selecting “Download” tab > Sequences >
        whole_chromosomes
    •   Each file includes one chromosome on current version.
        •   TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas,
            TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas,
            TAIR10_chrM.fas
    •   Because of limited server and network capacity, distributed
        these files with USB or web site for this lecture.
•   Concatenate these chromosomes except chloroplast and
    mitochondria into single file

                                                                      57
# We skip this process
#$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/
whole_chromosomes/TAIR10_chr[1-5].fas”
## 1-5 means consecutive numbers from 1 to 5.
## We do not use chroloplast and mitochondria genomes.
# Instead of the download, we use the files in USB.
# The files are in your working directory.
# Check it by below command.
$ ls TAIR10*
TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fas
TAIR10_chr2.fas TAIR10_chr4.fas
# Concatinate all chromosomes into single file
$ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas
TAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas
# Check the result
$ grep -e “^>” TAIR10_chr_all.fas
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated:
2009-02-02
>Chr2...
# You can find 5 chromosomes’ annotations
                                                                   58
Run BWA
    •   Make index on genome sequence
        •  For SOLiD reads, “-c” option is required.
        •  This process needs just once as long as you use the same
           genome (do not depend on read sequences).
    •   Convert reads’ colorspace into BWA specific format
        •  You don’t need this process for illumina reads.
            •  Illumina sequencers produce FastQ format files, and most
               alignment software can handle that directly.
    •   Mapping reads against genome sequence
        •  If you use illumina, -I option may be required. Check your
           illumina version.
    •   Above two processes may take long time. This lecture’s toy data
        is 1/100 scale. For real data will require more than two hours.

$   ./bwa/bwa index -c TAIR10_chr_all.fas
#   running messages. Takes more than 3 mins.
$   python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa
$   ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai
#   messages...about 1min. Alignment phase.                           59
Run BWA (continued)
  •   Convert mapping result into SAM format.
      •   You have to use “sampe” instead of “samse” for paired end
          experiment to put mate pair information into SAM format.
  •   That’s all! Check the contents of sam file with less command.
      •   How many reads can be mapped against genome?



$ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa >
tha_reads.sam
# messages. Generate summary of alignment.
# If you have paired ended reads, you can use sampe instead of samse.

$ less tha_reads.sam
# Press “q” to quit less command.
# Next page is “space”
                                                                      60
Inside of SAM file
  Chromosome (Mapped
  database) information
@SQ     SN:Chr1 LN:30427671     Used program and its variables
@SQ     SN:Chr2 LN:19698289
@SQ     SN:Chr3 LN:23459830
@SQ     SN:Chr4 LN:18585056
                                              Mapped read in forward
@SQ     SN:Chr5 LN:26975502
@PG     ID:bwa PN:bwa VN:0.5.9-r16               direction on Chr5
SRR038985.100   0       Chr5    22828962        37       33M     *
0       0       GCCGGTGATGTAATCAAAATATTTGCTACTCTT        WZYTWWTW]
YVUOW]OEKNUUX]PJSRY][63       XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i:
1 XO:i:0 XG:i:0 MD:Z:33
SRR038985.200   0       Chr3    14197678        0        33M     *
0       0       ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG        X]]KN]]
YWUX]XIKYRCHSUYX[[SNQJL[MO        XT:A:R CM:i:0 X0:i:2 X1:i:0
XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0;
SRR038985.300   4       *       0        0      *        *       0
0       AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT      124,/08/5&6-&,(;/4+
%7,+5.:1',*;8:&
                                                                    61
                                                   Unmapped read
Exercise (4)
•   Run BWA
•   Compare file size of csfasta + qual files with generated SAM file.
    •   Which is larger? How much disk space we need to analyze?
•   Check the details of SAM file
    •   Format details are described in http://
        samtools.sourceforge.net/SAM1.pdf
•   How many reads are mapped onto chromosomes.
    •   Select lines containing “Chr” # use grep
    •   Then, count the number of lines # use wc
•   Calculate ratio of mapped reads to total reads.


                                                                   62
Problems
• Mapped read ratio may be very lower than expected.
 • Genome quality is (probably) high.
• Various problems
 • Wet problems
   • Protocols and reagents
   • Mitochondria and chroloplast.
 • Dry problems
   • We used all sequences. We may need to remove low
       quality reads.
   • Sequence quality of 3’-end is low. We might trim these
       sequence.
   • We did not care about reads on splice junction.
   • We did not change any parameters in BWA. The
       parameter might not be suitable for our reads.
 • No one has versatile result.
• Note!!! mapped ratio of current RNA-Seq reads is (extremely)   63
   higher than this result.

More Related Content

What's hot

Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGScursoNGS
 
Making powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondMaking powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondAdamCribbs1
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingIntegrated DNA Technologies
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesIntegrated DNA Technologies
 
Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Integrated DNA Technologies
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIncreasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIntegrated DNA Technologies
 
Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Jennifer Shelton
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsGolden Helix Inc
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 

What's hot (20)

Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Making powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyondMaking powerful science: an introduction to NGS and beyond
Making powerful science: an introduction to NGS and beyond
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editing
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
 
Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...Target capture of DNA from FFPE samples— recommendations for generating robus...
Target capture of DNA from FFPE samples— recommendations for generating robus...
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymesIncreasing genome editing efficiency with optimized CRISPR-Cas enzymes
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
 
Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​Using BioNano Maps to Improve an Insect Genome Assembly​
Using BioNano Maps to Improve an Insect Genome Assembly​
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large CohortsRare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 

Viewers also liked

20110602labseminar pub
20110602labseminar pub20110602labseminar pub
20110602labseminar pubsesejun
 
次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習sesejun
 
RNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A ReviewRNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A Reviewsesejun
 
バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析sesejun
 
Datamining r 3rd
Datamining r 3rdDatamining r 3rd
Datamining r 3rdsesejun
 
20110214nips2010 read
20110214nips2010 read20110214nips2010 read
20110214nips2010 readsesejun
 
Datamining 3rd naivebayes
Datamining 3rd naivebayesDatamining 3rd naivebayes
Datamining 3rd naivebayessesejun
 
Datamining 9th association_rule.key
Datamining 9th association_rule.keyDatamining 9th association_rule.key
Datamining 9th association_rule.keysesejun
 
Datamining r 4th
Datamining r 4thDatamining r 4th
Datamining r 4thsesejun
 
Datamining 8th hclustering
Datamining 8th hclusteringDatamining 8th hclustering
Datamining 8th hclusteringsesejun
 
Datamining 5th knn
Datamining 5th knnDatamining 5th knn
Datamining 5th knnsesejun
 
Datamining r 1st
Datamining r 1stDatamining r 1st
Datamining r 1stsesejun
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2ndsesejun
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svmsesejun
 
Datamining 4th adaboost
Datamining 4th adaboostDatamining 4th adaboost
Datamining 4th adaboostsesejun
 

Viewers also liked (16)

20110602labseminar pub
20110602labseminar pub20110602labseminar pub
20110602labseminar pub
 
次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習次世代シーケンサが求める機械学習
次世代シーケンサが求める機械学習
 
RNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A ReviewRNAseqによる変動遺伝子抽出の統計: A Review
RNAseqによる変動遺伝子抽出の統計: A Review
 
バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析バイオインフォマティクスによる遺伝子発現解析
バイオインフォマティクスによる遺伝子発現解析
 
Datamining r 3rd
Datamining r 3rdDatamining r 3rd
Datamining r 3rd
 
20110214nips2010 read
20110214nips2010 read20110214nips2010 read
20110214nips2010 read
 
Datamining 3rd naivebayes
Datamining 3rd naivebayesDatamining 3rd naivebayes
Datamining 3rd naivebayes
 
Datamining 9th association_rule.key
Datamining 9th association_rule.keyDatamining 9th association_rule.key
Datamining 9th association_rule.key
 
Datamining r 4th
Datamining r 4thDatamining r 4th
Datamining r 4th
 
Datamining 8th hclustering
Datamining 8th hclusteringDatamining 8th hclustering
Datamining 8th hclustering
 
Datamining 5th knn
Datamining 5th knnDatamining 5th knn
Datamining 5th knn
 
Datamining r 1st
Datamining r 1stDatamining r 1st
Datamining r 1st
 
Datamining r 2nd
Datamining r 2ndDatamining r 2nd
Datamining r 2nd
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
Datamining 4th adaboost
Datamining 4th adaboostDatamining 4th adaboost
Datamining 4th adaboost
 
はじめての「R」
はじめての「R」はじめての「R」
はじめての「R」
 

Similar to 20110524zurichngs 1st pub

rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analysesfnothaft
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant researchFOODCROPS
 

Similar to 20110524zurichngs 1st pub (20)

RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
 

More from sesejun

Datamining 2nd decisiontree
Datamining 2nd decisiontreeDatamining 2nd decisiontree
Datamining 2nd decisiontreesesejun
 
Datamining 7th kmeans
Datamining 7th kmeansDatamining 7th kmeans
Datamining 7th kmeanssesejun
 
100401 Bioinfoinfra
100401 Bioinfoinfra100401 Bioinfoinfra
100401 Bioinfoinfrasesejun
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclusteringsesejun
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rulesesejun
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rulesesejun
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclusteringsesejun
 
Datamining 7th Kmeans
Datamining 7th KmeansDatamining 7th Kmeans
Datamining 7th Kmeanssesejun
 
Datamining R 4th
Datamining R 4thDatamining R 4th
Datamining R 4thsesejun
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svmsesejun
 
Datamining 5th Knn
Datamining 5th KnnDatamining 5th Knn
Datamining 5th Knnsesejun
 
Datamining 4th Adaboost
Datamining 4th AdaboostDatamining 4th Adaboost
Datamining 4th Adaboostsesejun
 
Datamining 3rd Naivebayes
Datamining 3rd NaivebayesDatamining 3rd Naivebayes
Datamining 3rd Naivebayessesejun
 
Datamining R 2nd
Datamining R 2ndDatamining R 2nd
Datamining R 2ndsesejun
 

More from sesejun (14)

Datamining 2nd decisiontree
Datamining 2nd decisiontreeDatamining 2nd decisiontree
Datamining 2nd decisiontree
 
Datamining 7th kmeans
Datamining 7th kmeansDatamining 7th kmeans
Datamining 7th kmeans
 
100401 Bioinfoinfra
100401 Bioinfoinfra100401 Bioinfoinfra
100401 Bioinfoinfra
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclustering
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rule
 
Datamining 9th Association Rule
Datamining 9th Association RuleDatamining 9th Association Rule
Datamining 9th Association Rule
 
Datamining 8th Hclustering
Datamining 8th HclusteringDatamining 8th Hclustering
Datamining 8th Hclustering
 
Datamining 7th Kmeans
Datamining 7th KmeansDatamining 7th Kmeans
Datamining 7th Kmeans
 
Datamining R 4th
Datamining R 4thDatamining R 4th
Datamining R 4th
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
 
Datamining 5th Knn
Datamining 5th KnnDatamining 5th Knn
Datamining 5th Knn
 
Datamining 4th Adaboost
Datamining 4th AdaboostDatamining 4th Adaboost
Datamining 4th Adaboost
 
Datamining 3rd Naivebayes
Datamining 3rd NaivebayesDatamining 3rd Naivebayes
Datamining 3rd Naivebayes
 
Datamining R 2nd
Datamining R 2ndDatamining R 2nd
Datamining R 2nd
 

20110524zurichngs 1st pub

  • 1. Next Generation Sequencing for Model and Non-Model Organism. 1st day Jun Sese and Kentaro Shimizu sesejun@cs.titech.ac.jp Ph.D course @ Univ. of Zurich 25/05/2011
  • 2. Today’s Menu • Lecture • Overview of next generation sequencer’s analysis • Mapping: Sequence alignment • Introduction to UNIX to handle NGS data • Exercise • UNIX commands • Mapping real short reads against genomes • Compute statistics of the mapped reads 2
  • 3. Various Types of Sequencers • Roche 454, IonTorrent • Roche: about 400bp, Ion Torrent: about 200bp • Suitable for denovo sequencing • Illumina HiSeq • Widely-used new generation sequencer • 100bpx2 up to 600 Gb/run (HiSeq 2000) • MiSeq uses almost same technology except number of reads • ABI SOLiD • 75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run (5500xl SOLiD) • Color Space • Pacific Biosciences PacBio RS • Average > 500 bp • Sequence quality is not high. 3
  • 4. Sequence cost becomes low dramatically Lincoln Stein, Genome Biology, vol. 11(5), 2010 4
  • 5. How large is it? • Generated file size is more than 300GB/run • We can read data from hard disks with 100 MB/sec • 300GB / 100MB/sec = 300,000MB / 100MB/sec = 3000 sec = 50min • To just read the data from HDD, computer takes 50min! • Require efficient calculation 5
  • 6. Applications of DNA Sequencing • NGS just read enormous short sequences, but has many biological applications. • Genetic variation • Gene regulations • RNA-seq • ChIP-seq • Epigenetics • Population genetics Science 2007 6
  • 7. Sequencerʼs Output Genome Sequence Mapping Program Mapping Result Visualization Further Analysis SNPs, RNA-Seq,... 7
  • 8. Major Pipelines of NGS • Most of the applications use the similar procedure. Genetic variation RNA-Seq ChIP-Seq Find originated Map Map Map region (Alignment) Check regulatory Filter SNP call Measure expressions regions Analysis Find difference Same as microarray Same as ChIP- Chip analysis Most of them require whole genome sequence to map reads. 8
  • 9. Mapping (Pairwise Alignment) • Find the place from which each read comes • BLAST is one of the very famous alignment software. • Few NGS analysis use BLAST/BLAT because of slow alignment speed. • BWA and Bowtie have been used to map short reads. Reads ATATGCGA ATATGCGA Reference GATGCTAAGCATATGCGAGGCATGCCATATGGATG We may find multiple mapped places. Score matrix (distance) defines which map is better. Reads ATATGCGA ATATGCGA ATATG-CGA x Reference GATGCTAAGCAAATGCGAGGCATGCCATATGGCGA 9
  • 10. 10
  • 11. For non-model organism Genetic Variation Chip-Seq RNA-Seq Read normalized Read genome Read genome library Genome/Gene Sequence Genome Genome RNA assembly assembly Assembly Map onto Map new reads Map ChIP-Seq related species Map Count genome reads assembled reads Map new RNA-Seq reads Check regulatory Filter SNP call regions Measure expressions Similar to Analysis Find Difference Same as microarray ChIP-Chip Most cases require genome assembly, which is experimentally and computationally high cost 11
  • 12. Very Short History of Pairwise Alignment Programs • More than 100 alignment programs are listed in Wikipedia!!! • http://en.wikipedia.org/wiki/Sequence_alignment_software • 1 sequence vs 1 sequence • Ssearch, FASTA [Lipman and Pearson. 1985] • 1 sequence vs Whole genes • BLAST [Altschul et al. 1990] • Thousands of sequences vs Whole genes or Whole genomes • BLAT [Kent. 2002] • Billions of short sequences vs Whole genome • BWA, Bowtie, SHRiMP, etc... • Most modern mappers use FM-index [Ferragina and Manzini. 2000] with Burrows-Wheeler transform [Burrows and Wheeler. 1994]. 12
  • 13. Why so many alignment programs have been developed? • Computer scientist seems that alignment is easy task. • Both indexing and dynamic programming used in sequence alignment are basic algorithm. • Good problem for home work • A little performance tuning can accelerates execution speed dramatically • In reality, alignment problem is very hard to solve. • Mutations, insertions, deletions... • Each sequencer has unique bias. • Sequence length. Homo-polymer in Roche 454... • Many heuristics exist in biologist! • GT-AG rule on splice site, but not always... • That is, problem definition is ambiguous! 13
  • 14. Alignment performance varies • Aligned 12million single end reads against human genome sequences (hg18) • Algorithm and implementation difference appear in total processed time • In most program, used memory depends on genome size. • Parameter settings reflect numbers of mapped reads. • Authors did not mention about them. • In real experiments, we have to change parameters to use alignment program. Bao et al. J Hum Genet, 2011 14
  • 15. Sequencerʼs Output Sequence Format Genome Sequence Mapping Program BWA, Bowtie, etc. Mapping Result Visualization 15
  • 16. Sequence File Format (1) • FASTA + Quality File • Used by Roche 454 >1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_ GCGTTGTGTATGTCTCCTTTGGTATGTCAGGTTTCGTCAGAAGCTTCTATCAAACGGCGC ACAGTGA >2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_ TCGGCCCTATCCGAGAAGGCGTGGTGTATCTCTCTTCTGGTATGCCACGTTACGCAGCAG CTTCTTCCCAAGACACAGAGCGAGTAAG >1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_ 37 35 35 35 35 35 37 37 37 37 37 39 39 37 36 35 35 36 37 37 37 37 35 35 32 28 27 27 27 27 29 23 21 21 14 14 12 18 19 19 19 19 19 19 16 16 17 20 22 20 12 12 12 12 11 17 17 17 16 19 22 23 24 21 21 21 18 >2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_ 29 30 19 19 19 20 19 24 28 27 27 27 27 27 30 19 19 20 20 20 24 33 33 33 33 33 33 33 35 35 37 37 30 30 30 30 32 32 32 32 35 32 32 32 32 33 33 33 33 20 20 20 23 27 30 30 31 31 27 27 27 27 28 23 24 24 23 23 23 24 24 21 17 19 19 18 27 18 17 16 16 16 17 13 18 17 16 12 16
  • 17. Sequence File Format (2) • FASTQ • Used by Illumina sequencers • Sequence database sites (SRA(Short read archive)/ENA (European Nucleotide Archive)/DRA(DDBJ Sequence Read Archive)) provide sequences with this format. • De-facto standard • CSFasta + Quality file • Only used in SOLiD sequencers • Similar to fasta file except sequences are described in color space. >SRR038985.100 VAB_AT1deg1_51_269_F3 T10303011231130321000333001323122221 >SRR038985.200 VAB_AT1deg1_78_430_F3 T03102101012320213012132121333132011 >SRR038985.100 VAB_AT1deg1_51_269_F3 0 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 23 19 26 28 9 22 18 21 25 25 23 2 20 >SRR038985.200 VAB_AT1deg1_78_430_F3 0 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 21 25 20 28 20 20 15 23 8 25 23 11 25 17
  • 18. Color Space • ABI SOLiD unique format. • Each number represents two base pair • Each nucleotide are in the SOLiD™ System: the Theory, Advantages and Solutions Color Space Analysis read twice • A spot detection miss may change downstream sequence. • Introduction The SOLiD™ System is the only next generationthis format. Some softwares did not support sequencing system to employ ligation based chemistry 2nd Base with di-base labelled probes. This unique approach provides significant advantages in terms of system 1st Base accuracy and downstream data analysis. T10303011 Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms Detection of more complicated genetic variation TGGCCGGTG such as adjacent SNPs, insertions, deletions and structural variations Double Interrogation: Each base is defined twice T10203011 Properties for a 2 Base Color Code Scheme The color code scheme is based on the Klein four- A T C A A group, which is the symmetry group of a rectangle. ABI White Paper: Figure 1: SOLiD Color Space Code TGGAATTGT It was designed to have the following properties which Color Space Analysis in the SOLiD enable the unique error checking capability. System: the Theory, Advantages and Solutions 18
  • 19. FASTQ Format One read @SRR013343.216 :3:1:837:436 Name GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT Sequence + I6IIII*II*II+I:+&I)I'&%&%,+0>+'I''$G Quality Score @SRR013343.217 :3:1:974:526 GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC + I@II6I<I/III;II+)I*II*DI*I?')+*+8/%8 @SRR013343.218 :3:1:755:341 GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC + IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I' 19
  • 20. PHRED quality encoding −Q Q = −10 log10 P ⇔ P = 10 10 • Q=20: 99% accuracy, Q=30: 99.9% accuracy • Quality value scale is slightly different between PHRED and illumina/SOLiD results • Encoded in FASTQ and SAM by quality string of “ASCII value - 33” • For illumina 1.3+, ASCII character has been changed to ASCII-64 character. ! 33 ‘ 39 - 45 3 51 9 57 ? 63 ... “ 34 ( 40 . 46 4 52 : 58 @ 64 ... # 35 ) 41 / 47 5 53 ; 59 A 65 ... $ 36 * 42 0 48 6 54 < 60 B 66 ... % 37 + 43 1 49 7 55 = 61 C 67 ... & 38 , 44 2 50 8 56 > 62 D 68 ... 20
  • 21. Sequencerʼs Output Sequence Format Genome Sequence Mapping Program BWA, Bowtie, etc. Mapping Result Output Format Visualization 21
  • 22. SAM Format • Sequence Alignment / Map format • Simple tab-delimited text file • Standardized alignment output format • Modern alignment tools support this format • BAM format is binary version of SAM format. @HD VN:1.0 @SQ! SN:chr20 LN:62435964 @RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2 22
  • 23. Overview <QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]] read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 23
  • 24. Flag • Bitwise notation: computer friendly (human non- friendly format :) • 16 = 0x0010: mapped reverse strand • 4 = 0x0004: unmapped • 0 = 0x0000: mapped forward strand 24
  • 25. CIGAR • Show alignment result simply • 8M9I7M • 8bp match, 9bp insertion, and then 7bp match 8M 9I 7M CATATGCG---------ATATGGA |||||||| ||||||| GATGCTAAGCATATGCGAGGCATGCCATATGGATG 4th line “POS” indicates this position. 25
  • 26. Summary • No standard tools for analyzing NGS data • QA sites are good resources • SeqAnswers.com • biostar.stackexchange.com • Many algorithms and softwares have been developed. • See. http://www.oxfordjournals.org/our_journals/ bioinformatics/nextgenerationsequencing.html • Most of them work with UNIX command line • Few analysis tools with GUI • Galaxy (Free, require server setup) • BioScope (Only available with SOLiD sequencer) 26
  • 27. Unix Commands Sequencerʼs Output Sequence Format Genome Sequence Performed Mapping Program BWA, Bowtie, etc. with UNIX commands Mapping Result Output Format Visualization 27
  • 28. Preparation • NGS procedure generate many files. • Even in this lecture, we will generate 50 files. • We use directory generated by extracting “ngslec.zip.” • Extract the zip file in your home directory. • To move to the directory, we type the following command in Terminal $ cd ngslec $ pwd /Users/YOUR_DIRECTORY/ngslec/ 28
  • 29. Use “Terminal” • Operating System (OS) handle movements on computer. • Read files, mouse click, visualize characters, ... • We can use the OS functions through application “Terminal” on UNIX OS • Applications > Utilities > Terminal • UNIX: Linux, IBM AIX, Sun OS, Mac OS X • except Windows and Mac OS -9 • In the terminal, we can use shell commands. • Applications consists of a procedure of the shell commands. • A complicated program is made of a set of tiny programs. • We start to learn usage of tiny programs, and then how to combine them. Kernel Shell Terminal 29
  • 30. Command and Arguments $ rm -r arg1 arg2 (A) Command (Order): run a command called “rm” (B),(C) and (D) Arguments: separated by space character between command and arguments and between arguments (B) Arguments that change sub functions of the command are called “Option.” Options starts from “-” or “--” (C) First argument. We count argument number except options. (D) Second argument. 30
  • 31. Example: date command • Input “date” + [Return] to show current time • With option “-u”, “date” command shows Coordinated Universal time. • If you misspell command, terminal says “command not found.” • Commands (and file names) are case sensitive on UNIX except Mac OS X. 31
  • 32. File System • You may always use this system through “Finder.” In this lecture, we will use this from “Terminal.” • Tree structure rooted by “/” • USB memories and DVDs are also managed through file system. / usr Volume bin lib pics USB zurich 32
  • 33. Directories and Files • Current directory / • Directory on which you are working • You can check “pwd” command. usr Users • Home directory * • Root (top) of your personal directory bin lib sesejun • Denoted by “~” or “$HOME” • When your current directory is “/Users/ usr sesejun” ** • pwd command shows /Users/sesejun lib • /usr/lib indicates * • usr/lib indicates ** • “.” is equal to “/Users/sesejun” • .. is equal to /Users • ../../usr/lib is equal to “/usr/lib” 33
  • 34. cd: Change Directory • cd destination-dir • move your current directory to destination-dir • When you omit (unset) arguments, move to home dir. jsmbp:~ sesejun$ pwd /Users/sesejun jsmbp:~ sesejun$ cd /usr/ jsmbp:/usr sesejun$ pwd /usr jsmbp:/usr sesejun$ cd lib jsmbp:/usr/lib sesejun$ pwd /usr/lib jsmbp:/usr/lib sesejun$ cd /usr/bin/ jsmbp:/usr/bin sesejun$ pwd /usr/bin jsmbp:/usr/bin sesejun$ cd jsmbp:~ sesejun$ pwd /Users/sesejun jsmbp:~ sesejun$ 34
  • 35. ls (LiSt): Show List of Files • Show current directory files when setting no arguments • Important options • -a: Show all files (Files starting from “.” do not appear when we do not set this option) • -l: Show detail information of files • -h: Show file size in human friendly format (usually used with option “-l”) • $ ls Desktop Music largefile $ ls -l drwx------+ 8 sesejun staff 272 5 16 00:09 Desktop drwx------+ 3 sesejun staff 102 10 27 2010 Movies -rw-r--r-- 1 sesejun staff 4181139 5 16 08:20 largefile $ ls -lh drwx------+ 8 sesejun staff 272B 5 16 00:09 Desktop drwx------+ 3 sesejun staff 102B 10 27 2010 Movies -rw-r--r-- 1 sesejun staff 4.0M 5 16 08:20 largefile 35
  • 36. cp: Copy Files • cp [options] source-file ... directory • cp [options] source-file new-file • Options: • Copy text1.txt to text2.txt $ cp text1.txt text2.txt • Copy text1.txt and text2.txt in “tmp” directory $ cp text1.txt text2.txt tmp/ $ ls tmp text1.txt text2.txt 36
  • 37. mv: Move files • Also used to change file names • mv [options] source-file ... directory • mv [options] old-path new-path • Change filename text1.txt to text2.txt $ mv text1.txt text2.txt • Move text1.txt and text2.txt into tmp directory $ mv text1.txt text2.txt tmp/ $ ls tmp $ ls tmp/ text1.txt text2.txt 37
  • 38. rm (ReMove): Delete files • Options: • -r: Remove all the files in directory • -i: Confirm before removing each file. • Delete text1.txt and text2.txt jsmbp:~ sesejun$ rm text1.txt text2.txt • Delete all the files within tmp directory • Note: These files are “really” removed. They never go to “Trash.” We cannot use undo. jsmbp:~/test sesejun$ ls tmp jsmbp:~/test sesejun$ ls tmp/ text1.txt text2.txt jsmbp:~/test sesejun$ rm -r tmp/ jsmbp:~/test sesejun$ ls jsmbp:~/test sesejun$ 38
  • 39. Exercise (1) • Run commands • Run date and date -u, and check the results. • Run command “cal” What is the result? • Change directory • Run examples in page “cd” • Check make and remove directory • Open your login name directory in Finder. • Move your home directory in Terminal. • Just open terminal. • Run ls and compare the result with Finder result. 39
  • 40. Note • Commands and messages in Terminal are describes with “Courier Font” • Lines starting from “#” is comment line. You do not need to put them in Terminal. • Lines whose last character is “” continue next line. You put the multiple lines as one line. • You can run commands with “cut and paste.” • To do that, double quotation (“) character make trouble because of difference of character types. Re-inputing double quotation will solve the problem. • Bar (|) can be input by Alt + 7. • In Terminal, you can show history of your commands by pushing up cursor. • “Tab” key may complement your command or filename. 40
  • 41. cat (conCATenate) • cat [options] file ... $ cat text1.txt How are you ? • Original usage is file $ cat text2.txt Hello! concatenation. Thank you! • Show detail later Good Bye! • Some times this command is used $ cat text1.txt text2.txt How are you ? to show inside of file. Hello! • Options: Thank you! Good Bye! • -n: show line number $ cat -n text2.txt 1 Hello! 2 Thank you! 3 Good Bye! 41
  • 42. head, tail (Show first or last part of file) • head [-n num] file ... • Show first 10 lines $ cat text2.txt • -n num: show first num lines Hello! • Thank you! tail [-n num] file ... Good Bye! • $ head -n2 text2.txt Show last 10 lines Hello! • -n num: show last num lines Thank you! •by setting +num, you can $ tail -n2 text2.txt Thank you! see file from num-th line to Good Bye! $ tail -n+3 text2.txt last line. Good Bye! • Because of large size of NGS file, these commands are frequently used. • Most editors cannot open NGS 42 files.
  • 43. less • less <filename> • Show files interactively • Space: Next page • ‘b’: Previous page • ‘q’: Quit • ‘/’ + [word]: search [word] and go to first matched place. The word is highlighted. • To move next place, press ‘n.’ • Frequently used to check contents of (large) file like FastA file 43
  • 44. cut -Show columns- • cut [options] file ... • Show selected columns • Options: • -f <list of nums>: Show <list of nums>-th columns. We can use -d option to set separator between columns. Default separator is “t (Tab).” • -c <list of nums>: Show <list of nums>-th characters. • Examples of “list of nums” • 1,3,5: 1st, 3rd and 5th columns • 1-5: From 1st to 5th columns • 1,3,5-: 1st, 3rd and from 5th to last columns. • This command is also frequently used to handle NGS files. 44
  • 45. Sort • sort [options] file ... • Arrange file contents in alphabetical order $ cat text2.txt • Hello! Options: Thank you! • Good bye! -r: reverse order $ sort text2.txt • -n: order in numerical value Good bye! Hello! • -k POS: order according to POS-th Thank you! $ sort -r text2.txt column. Default delimiter is “t.” Thank you! We can change it with “-t” option. Hello! Good bye! 45
  • 46. $ cat nums.tab $ cat nums.tab 11.2 13.2 11.2 13.2 10.9 7.7 10.9 7.7 15.2 7.0 15.2 7.0 9.4 10.9 9.4 10.9 8.8 9.1 8.8 9.1 $ cut -f1 nums.tab $ sort -n nums.tab 11.2 8.8 9.1 10.9 9.4 10.9 15.2 10.9 7.7 9.4 11.2 13.2 8.8 15.2 7.0 $ cut -f1 -d . nums.tab $ sort -n -k2 nums.tab 11 15.2 7.0 10 10.9 7.7 15 8.8 9.1 9 9.4 10.9 8 11.2 13.2 $ cut -c1-3 nums.tab $ sort nums.tab 11. 10.9 7.7 10. 11.2 13.2 15. 15.2 7.0 9.4 8.8 9.1 8.8 9.4 10.9 46
  • 47. Exercise (2) • Generate two files “test1.txt” and “test2.txt” • Run cat, head and tail command according to examples. • Generate file “nums.txt” • Character between numbers (columns) is “tab.” • Test cut and sort commands according to examples. 47
  • 48. Redirect (>) • command > file • Save command result into “file.” • Overwrite contents of file. • The following command save the result of “sort -n nums.tab” into “nums_sort.tab” • command >> file • Add command result to “file.” $ sort -n nums.tab > nums_sort.tab $ sort -n nums.tab >> nums_sort.tab 48
  • 49. Pipe (|) • command1 | command2 • Run command2 with command1’s result $ sort -n nums.tab 8.8 9.1 9.4 10.9 10.9 7.7 11.2 13.2 15.2 7.0 $ sort -n nums.tab | cat -n 1 8.8 9.1 2 9.4 10.9 3 10.9 7.7 4 11.2 13.2 5 15.2 7.0 $ sort -n nums.tab | cat -n | head -n2 1 8.8 9.1 2 9.4 10.9 $ sort -n nums.tab | cat -n produces the same result as $ sort -n nums.tab > nums_sort.tab 49 $ cat -n nums_sort.tab
  • 50. Commands used with pipe • sort, cut • less • wc [options] file... • Word Count • Show number of lines, words and characters. $ sort nums.tab | less $ wc nums.tab 5 10 45 nums.tab #lines #words #chrs $ wc -l nums.tab 5 nums.tab Show only number of lines 50
  • 51. gzip and bzip2 • Source codes and sample datasets are provided with tar and gzip/bzip2 file. • Only gzip/bzip2 is used for single file. • “tar” can generate single file containing files and folders. • gzip/bzip2 can compress file • gzip is the most frequently used. bzip2 file size is smaller than gzip. $ ls -lh chr21.fa.gz -rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz $ gzip -d chr21.fa.gz Decompress hs_ref_chr21.fa.gz and generate hs_ref_chr21.fa. $ ls -lh chr21.fa -rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa $ gzip chr21.fa Compress $ ls -lh chr21.fa.bz2 51 -rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2
  • 52. tar (Tape ARchive) • Generate single file containing files and folders. • Frequently used with gzip/bzip2 • Remember the following idioms! • We will use this to install programs to analyze NGS data. with gzip 1. $ gzip -dc file.tar.gz | tar xvf - 2. $ tar zxvf file.tar.gz with bzip2 1. $ bzip2 -dc file.tar.bz2 | tar xvf - Tar has no option to decompress bzip2. 52
  • 53. grep (g/re/p) grep [options] file ... $ cat nums.tab • Print lines matching pattern 11.2 10.9 13.2 7.7 • Options: 15.2 7.0 • -v: print non-matching lines 9.4 8.8 10.9 9.1 • -e <regular expression>: select line $ grep “7” nums.tab with regular expression 10.9 7.7 • 15.2 7.0 Regular expression $ grep -v “7” nums.tab • Specific pattern to express 11.2 9.4 13.2 10.9 character sequence 8.8 9.1 • ^: The beginning of line $ grep -e "^1" nums.tab • $: The end of line 11.2 10.9 13.2 7.7 • Supported by most programming 15.2 7.0 languages. Very useful to handle various formats including DNA/ Protein sequence. 53
  • 54. Exercise (3) • Use “TAIR10_chr1.fas” • A.thaliana chromosome 1 sequence • Select annotation line from FASTA format. • FASTA format • Line starting from “>” is annotation of sequence. • The following lines of the annotation contains nucleotide or amino acid sequence. • To select an annotation, select lines starting from “>” • Count number of nucleotides in (Multi) FASTA format • Lines including nucleotides do not start from “>” • Number of nucleotides = number of characters • Use “wc” command • Note that the end of line contains “Return” character >gi|29028877|gb|BT005883|U23535 ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCA 54 TCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTC CACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT
  • 55. Let’s start NGS analysis! • Dataset • TAIR 10 genome (A.thaliana) • 1/100 scale SOLiD RNA-Seq reads sets • Filenames: tha_reads.csfasta & tha_reads_QV.qual • SRR038985: 41,117,124 reads, 1,439,099,340 bp • http://trace.ddbj.nig.ac.jp/DRASearch/experiment? acc=SRX018529 • Filenames: lyr_reads.csfasta & lyr_reads_QV.qual • SRR038987: 41,340,154 reads, 1,446,905,390 bp • http://trace.ddbj.nig.ac.jp/DRASearch/experiment? acc=SRX018531 • 1/10 scale Roche 454 Read Set (SRR020799) $ grep -e “^>” tha_reads.csfasta | wc -l 55 411171
  • 56. Installing BWA • In this lecture, because our computer do not have “gcc” command to compile C language, we skip this procedure. • Download BWA • http://bio-bwa.sourceforge.net/ • bwa-0.5.8c.tar.bz2 exists in USB. Copy the file. • Extract the file • Move into BWA directory • Compile source programs • Make alias name “bwa” for bwa-0.5.8c directory # $ curl -O # http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2 # $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf - # ...filenames... # $ ln -s bwa-0.5.8c bwa # Simplify the directory name # $ cd bwa # $ make # ...compile messages... # $ cd .. # back to working directory 56
  • 57. Prepare A.thaliana Genome • Download chromosomes from TAIR site • http://www.arabidopsis.org/ • Find URLs by selecting “Download” tab > Sequences > whole_chromosomes • Each file includes one chromosome on current version. • TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas, TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas, TAIR10_chrM.fas • Because of limited server and network capacity, distributed these files with USB or web site for this lecture. • Concatenate these chromosomes except chloroplast and mitochondria into single file 57
  • 58. # We skip this process #$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/ whole_chromosomes/TAIR10_chr[1-5].fas” ## 1-5 means consecutive numbers from 1 to 5. ## We do not use chroloplast and mitochondria genomes. # Instead of the download, we use the files in USB. # The files are in your working directory. # Check it by below command. $ ls TAIR10* TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fas TAIR10_chr2.fas TAIR10_chr4.fas # Concatinate all chromosomes into single file $ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas TAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas # Check the result $ grep -e “^>” TAIR10_chr_all.fas >Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02 >Chr2... # You can find 5 chromosomes’ annotations 58
  • 59. Run BWA • Make index on genome sequence • For SOLiD reads, “-c” option is required. • This process needs just once as long as you use the same genome (do not depend on read sequences). • Convert reads’ colorspace into BWA specific format • You don’t need this process for illumina reads. • Illumina sequencers produce FastQ format files, and most alignment software can handle that directly. • Mapping reads against genome sequence • If you use illumina, -I option may be required. Check your illumina version. • Above two processes may take long time. This lecture’s toy data is 1/100 scale. For real data will require more than two hours. $ ./bwa/bwa index -c TAIR10_chr_all.fas # running messages. Takes more than 3 mins. $ python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa $ ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai # messages...about 1min. Alignment phase. 59
  • 60. Run BWA (continued) • Convert mapping result into SAM format. • You have to use “sampe” instead of “samse” for paired end experiment to put mate pair information into SAM format. • That’s all! Check the contents of sam file with less command. • How many reads can be mapped against genome? $ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa > tha_reads.sam # messages. Generate summary of alignment. # If you have paired ended reads, you can use sampe instead of samse. $ less tha_reads.sam # Press “q” to quit less command. # Next page is “space” 60
  • 61. Inside of SAM file Chromosome (Mapped database) information @SQ SN:Chr1 LN:30427671 Used program and its variables @SQ SN:Chr2 LN:19698289 @SQ SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 Mapped read in forward @SQ SN:Chr5 LN:26975502 @PG ID:bwa PN:bwa VN:0.5.9-r16 direction on Chr5 SRR038985.100 0 Chr5 22828962 37 33M * 0 0 GCCGGTGATGTAATCAAAATATTTGCTACTCTT WZYTWWTW] YVUOW]OEKNUUX]PJSRY][63 XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i: 1 XO:i:0 XG:i:0 MD:Z:33 SRR038985.200 0 Chr3 14197678 0 33M * 0 0 ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG X]]KN]] YWUX]XIKYRCHSUYX[[SNQJL[MO XT:A:R CM:i:0 X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0; SRR038985.300 4 * 0 0 * * 0 0 AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT 124,/08/5&6-&,(;/4+ %7,+5.:1',*;8:& 61 Unmapped read
  • 62. Exercise (4) • Run BWA • Compare file size of csfasta + qual files with generated SAM file. • Which is larger? How much disk space we need to analyze? • Check the details of SAM file • Format details are described in http:// samtools.sourceforge.net/SAM1.pdf • How many reads are mapped onto chromosomes. • Select lines containing “Chr” # use grep • Then, count the number of lines # use wc • Calculate ratio of mapped reads to total reads. 62
  • 63. Problems • Mapped read ratio may be very lower than expected. • Genome quality is (probably) high. • Various problems • Wet problems • Protocols and reagents • Mitochondria and chroloplast. • Dry problems • We used all sequences. We may need to remove low quality reads. • Sequence quality of 3’-end is low. We might trim these sequence. • We did not care about reads on splice junction. • We did not change any parameters in BWA. The parameter might not be suitable for our reads. • No one has versatile result. • Note!!! mapped ratio of current RNA-Seq reads is (extremely) 63 higher than this result.