Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
  The workflow of NGS data analysis
                            Data analysis

                 Raw machine reads… What’s next?

                Preprocessing (machine/technology)
                 - adaptors, indexes, conversions,…
                 - machine/technology dependent

              Reads with associated qualities (universal)
                              - FASTQ
                            - QC check

            Depending on application (general applicable)
        - ‘de novo’ assembly of genome (bacterial genomes,…)
         - Mapping to a reference genome  mapped reads
                          - SAM/BAM/…

             High-level analysis (specific for application)
                            - SNP calling
                           - Peak calling
Previously in this workshop…
  The workflow of NGS data analysis
Previously in this workshop…
                                     Main data formats
                                     Raw sequence reads:

- Represent the sequence ~ FASTA
  >SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
  @SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
  +
  !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65



- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)
Previously in this workshop…
                                Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM

DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Previously in this workshop…
                                         Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –


- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Previously in this workshop…
                                       Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)




browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Previously in this workshop…
                                    Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Previously in this workshop…
                                     Main data formats
- VCF format (Variant Call Format)
For SNP representation
Previously in this workshop…
                                  Main data formats
- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                      The workflow
Mapping:

Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions

- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)

- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
  reference genome and/or the sequence reads

- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
  speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …

>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
Mapping to a reference genome
                                       The workflow
The reference genome

- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
  Ensembl
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use

- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)

- BWA example:

bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.

OPTIONS:
-c         Build color-space index. The input fast should be in nucleotide space.
-p STR     Prefix of the output database [same as db filename]
-a STR     Algorithm for constructing BWT index. Available options are:
is         IS linear-time algorithm for constructing suffix array.
           It requires 5.37N memory where N is the size of the database.
bwtsw      Algorithm implemented in BWT-SW. This method works with the whole human genome
Mapping to a reference genome
                                     The workflow
The sequencing reads

- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well

- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
  index
- Bowtie: not needed: indexing and aligning in one step

- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
  SAI)
- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
Mapping to a reference genome
                                       The workflow
aln        bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
            <in.db.fasta> <in.query.fq> > <out.sai>

Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.

OPTIONS:
-n NUM     Maximum edit distance if the value is INT
-o INT     Maximum number of gap opens
-e INT     Maximum number of gap extensions, -1 for k-difference mode
-d INT     Disallow a long deletion within INT bp towards the 3’-end
-i INT     Disallow an indel within INT bp towards the ends [5]
-l INT     Take the first INT subsequence as seed.
-k INT     Maximum edit distance in the seed
-t INT     Number of threads (multi-threading mode)
-M INT     Mismatch penalty
-O INT     Gap open penalty
-E INT     Gap extension penalty
-R INT     Proceed with suboptimal alignments
-c         Reverse query but not complement it
-N         Disable iterative search.
-q INT     Parameter for read trimming.
-I         The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT     Length of barcode starting from the 5’-end.
-b         Specify the input read sequence file is the BAM format.
-0         When -b is specified, only use single-end reads in mapping.
-1         When -b is specified, only use the first read in a read pair in mapping
-2         When -b is specified, only use the second read in a read pair in mapping
Mapping to a reference genome
                                       The workflow
samse      bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.

OPTIONS:
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’


sampe      bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.

OPTIONS:
-a INT     Maximum insert size for a read pair to be considered being mapped properly.
-o INT     Maximum occurrences of a read for pairing.
-P         Load the entire FM-index into memory to reduce disk operations
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly
-N INT     Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)

Maps the input sequences (FASTQ) to the reference genome index  output: indexes of
 the reads

No ‘real genomic mapping’ thus, this would need a next step…
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –


bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores

This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file  processing by samtools

| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)
Mapping to a reference genome
                                        The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –

Two-step process in BWA

Next steps: process the BAM file  sort and index it (using samtools)

samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted

Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam

Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
Mapping to a reference genome
                                         The workshop
BAM: what’s next?

So, now we have the sorted and indexed BAM file – what’s next?

This file is the starting point for all other analysis, depending on the application:

ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants

What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
  distribution, distribution accross chromosomes,…)
Mapping to a reference genome
                                        The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samstat
/opt/samstat/samstat PHF8-sorted.bam



- Outputs a HTML file with statistics
Mapping to a reference genome
                                                The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

BamUtil (stats)

Bam stats --in PHF8-sorted.bam –-basic --phred        --baseSum

Number of records read = 15732744

TotalReads(e6)   15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6)   14.65
DuplicateReads(e6)                  0.00
QCFailureReads(e6)                  0.00

MappingRate(%)   95.59
PairedReads(%)   100.00
ProperPair(%)    93.11
DupRate(%)       0.00
QCFailRate(%)    0.00

TotalBases(e6)   802.37
BasesInMappedReads(e6)              766.95

Quality          Count
33               0
34               0
35               71373
36               0
37               0
38               203544
39               403649
40               921714
41               2081099
42               1974615
43               2285826
Mapping to a reference genome
                                       The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samtools
samtools-0.1.18 idxstats PHF8-sorted.bam

1      249250621        503714   0
2      243199373        345217   0
3      198022430        273477   0
4      191154276        229016   0
5      180915260        360339   0
6      171115067        257468   0
7      159138663        269704   0
8      146364022        242656   0
9      141213431        203505   0
10     135534747        237496   0
11     135006516        218116   0
12     133851895        231426   0
13     115169878        106831   0
14     107349540        119062   0
15     102531392        141351   0
16     90354753         183004   0
17     81195210         187024   0
18     78077248         86101    0
Mapping to a reference genome
                                     The workshop
First downstream analysis

- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM
  file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates

- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?
Mapping to a reference genome
                                     The workshop
Mapping – now let’s start!

- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:

    - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
      mapping quality / coverage /  identification of SNPs (VCF output format)

    - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
      identified (BED output, BEDgraph and/or WIG files)

    - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
      reads in the sequencing library = RPKM)  (relative) expression levels 
      identification of differentially expressed genes
Blok
de   Van…
       ETER

Workshop NGS data analysis - 2

  • 1.
    Sequencing data analysis Workshop– part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 2.
    Previously in thisworkshop… Introduction – the real cost of sequencing
  • 3.
    Previously in thisworkshop… Introduction – the real cost of sequencing
  • 4.
    Previously in thisworkshop… The workflow of NGS data analysis Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • 5.
    Previously in thisworkshop… The workflow of NGS data analysis
  • 6.
    Previously in thisworkshop… Main data formats Raw sequence reads: - Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT - Extension: represent the quality, per base ~ FASTQ – Q for quality Score ~ phred ~ ASCII table ~ phred + 33 = Sanger @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - Machine and platform independent and compressed: SRA (NCBI) Get the original FASTQ file using SRATools (NCBI)
  • 7.
    Previously in thisworkshop… Main data formats - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map) - BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • 8.
    Previously in thisworkshop… Main data formats - BED files (location / annotation / scores): Browser Extensible Data Used for mapping / annotation / peak locations / - extension: bigBED (binary) FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 – - BEDGraph files (location, combined with score) Used to represent peak scores track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
  • 9.
    Previously in thisworkshop… Main data formats - WIG files (location / annotation / scores): wiggle Used for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks) browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
  • 10.
    Previously in thisworkshop… Main data formats - GFF format (General Feature Format) or GTF Used for annotation of genetic / genomic features – such as all coding genes in Ensembl Often used in downstream analysis to assign annotation to regions / peaks / … FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • 11.
    Previously in thisworkshop… Main data formats - VCF format (Variant Call Format) For SNP representation
  • 12.
    Previously in thisworkshop… Main data formats - http://genome.ucsc.edu/FAQ/FAQformat.html - UCSC brower data formats, including all most commonly used formats that are accepted and widely used - In addition, ENCODE data formats (narrowPeak / broadPEAK)
  • 13.
    Sequencing data analysis Workshop– part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 14.
    Mapping to areference genome The workflow Mapping: Aligning the raw sequence reads to a reference genome by using an indexing strategy and aligning algorithm, taking into account the quality scores and with specific conditions - Raw sequence reads with quality scores: FASTQ - Reference genome: FASTA files can be downloaded (UCSC/Ensembl) - Sequence reads <> reference genome: alignment - To perform an efficient alignment, an indexing strategy is used - For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the reference genome and/or the sequence reads - Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; … >> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
  • 15.
    Mapping to areference genome The workflow The reference genome - Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or Ensembl - Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa) - Need to be indexed by the mapping program you are going to use - BWA: bwa index - Bowtie: bowtie-build (pre-computed indexes available) - BWA example: bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta> Index database sequences in the FASTA format. OPTIONS: -c Build color-space index. The input fast should be in nucleotide space. -p STR Prefix of the output database [same as db filename] -a STR Algorithm for constructing BWT index. Available options are: is IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome
  • 16.
    Mapping to areference genome The workflow The sequencing reads - Sequence reads with quality scores: FASTQ files from the machine - Depending on the mapping program, need to be indexed as well - BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome index - Bowtie: not needed: indexing and aligning in one step - BWA: - Index reference genome - Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT: SAI) - SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
  • 17.
    Mapping to areference genome The workflow aln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q] <in.db.fasta> <in.query.fq> > <out.sai> Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence maximum maxDiff differences are allowed in the whole sequence. OPTIONS: -n NUM Maximum edit distance if the value is INT -o INT Maximum number of gap opens -e INT Maximum number of gap extensions, -1 for k-difference mode -d INT Disallow a long deletion within INT bp towards the 3’-end -i INT Disallow an indel within INT bp towards the ends [5] -l INT Take the first INT subsequence as seed. -k INT Maximum edit distance in the seed -t INT Number of threads (multi-threading mode) -M INT Mismatch penalty -O INT Gap open penalty -E INT Gap extension penalty -R INT Proceed with suboptimal alignments -c Reverse query but not complement it -N Disable iterative search. -q INT Parameter for read trimming. -I The input is in the Illumina 1.3+ read format (quality equals ASCII-64) -B INT Length of barcode starting from the 5’-end. -b Specify the input read sequence file is the BAM format. -0 When -b is specified, only use single-end reads in mapping. -1 When -b is specified, only use the first read in a read pair in mapping -2 When -b is specified, only use the second read in a read pair in mapping
  • 18.
    Mapping to areference genome The workflow samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> Generate alignments in the SAM format given single-end reads Repetitive hits will be randomly chosen. OPTIONS: -n INT Maximum number of alignments to output in the XA tag for reads paired properly. -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’ sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta> <in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam> Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly. OPTIONS: -a INT Maximum insert size for a read pair to be considered being mapped properly. -o INT Maximum occurrences of a read for pairing. -P Load the entire FM-index into memory to reduce disk operations -n INT Maximum number of alignments to output in the XA tag for reads paired properly -N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
  • 19.
    Sequencing data analysis Workshop– part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 20.
    Mapping to areference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 BWA and its version aln: alignement functionality of BWA -t 4: use 4 processes (CPU cores) at the same time to speed up /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.fastq: fastq file to align to the reference > Indicates outputting to a file SRR058523.sai: the output file (SA Index file) Maps the input sequences (FASTQ) to the reference genome index  output: indexes of the reads No ‘real genomic mapping’ thus, this would need a next step…
  • 21.
    Mapping to areference genome The workshop Mapping using BWA bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF6-unsorted.bam – bwa-0.5.9 BWA and its version samse: single-end mapping and output to sam format /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.sai: the reads index SRR058523.fastq: the raw reads and quality scores This would output a sam file (> SRR058523.sam) for instance But we don’t need the SAM file, we would like a BAM file  processing by samtools | is the ‘pipe’ symbol: hands over the output from one command to the other samtools-0.1.18: samtools and its version view: the command to process sam files - B output BAM ; h print the headers; S input is SAM; o output name PHF6-unsorted.bam: output file name - End of the | symbol (end of second command)
  • 22.
    Mapping to areference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF8-unsorted.bam – Two-step process in BWA Next steps: process the BAM file  sort and index it (using samtools) samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted Creates a sorted BAM file (PHF6-sorted.bam) samtools-0.1.18 index PHF8-sorted.bam Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
  • 23.
    Mapping to areference genome The workshop BAM: what’s next? So, now we have the sorted and indexed BAM file – what’s next? This file is the starting point for all other analysis, depending on the application: ChIP-seq: peak calling SNP calling RNA-seq: calculate gene-expression levels of the transcripts / find splice variants What are the first things? - Visualize it (IGV can load BAM files) - First downstream analysis: QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes,…)
  • 24.
    Mapping to areference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samstat /opt/samstat/samstat PHF8-sorted.bam - Outputs a HTML file with statistics
  • 25.
    Mapping to areference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) BamUtil (stats) Bam stats --in PHF8-sorted.bam –-basic --phred --baseSum Number of records read = 15732744 TotalReads(e6) 15.73 MappedReads(e6) 15.04 PairedReads(e6) 15.73 ProperPair(e6) 14.65 DuplicateReads(e6) 0.00 QCFailureReads(e6) 0.00 MappingRate(%) 95.59 PairedReads(%) 100.00 ProperPair(%) 93.11 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases(e6) 802.37 BasesInMappedReads(e6) 766.95 Quality Count 33 0 34 0 35 71373 36 0 37 0 38 203544 39 403649 40 921714 41 2081099 42 1974615 43 2285826
  • 26.
    Mapping to areference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samtools samtools-0.1.18 idxstats PHF8-sorted.bam 1 249250621 503714 0 2 243199373 345217 0 3 198022430 273477 0 4 191154276 229016 0 5 180915260 360339 0 6 171115067 257468 0 7 159138663 269704 0 8 146364022 242656 0 9 141213431 203505 0 10 135534747 237496 0 11 135006516 218116 0 12 133851895 231426 0 13 115169878 106831 0 14 107349540 119062 0 15 102531392 141351 0 16 90354753 183004 0 17 81195210 187024 0 18 78077248 86101 0
  • 27.
    Mapping to areference genome The workshop First downstream analysis - Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM file, indicating it is a duplicate) - Samtools rmdup or Picard MarkDuplicates - Find out how these tools work and what otyher flags are used in BAM files - Can you make statistics with the BAM flags?
  • 28.
    Mapping to areference genome The workshop Mapping – now let’s start! - Mapping is only the starting point for most downstream analysis tools - Depends on the application and what you want to do: - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on mapping quality / coverage /  identification of SNPs (VCF output format) - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are identified (BED output, BEDgraph and/or WIG files) - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of reads in the sequencing library = RPKM)  (relative) expression levels  identification of differentially expressed genes
  • 29.
    Blok de Van… ETER