Deep Seq Data Analysis
Theoretical training
Christophe.antoniewski@upmc.fr
http://artbio.fr
Mouse Genetics
January 21, 2016, 13:30–15:00
Sequencing Technologies
Latest commercialized Sequencing Technology
e Sequencing-by-pH-variations in ION TORRENT
Sequencing Technologies : Quantitative Facts
Sequencing Technologies : Focus on Illumina
technology
Deep sequencing applications
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
Stranded RNAseq
library
20-30nt RNA gel
purification
Small RNA library
(Biases)
Library “Bar
coding”
ChIPseq library preparation
(Non Directional)
What can I do with my sequence reads ?
◆
➢
◆ …
➢
◆ …
➢
Platform
Selection
Library
Preparation
Sequencing
Quality Control
Alignment Assembly
Visualization & Statistics
• Normalization (library comparison)
• Peak finding (Binding sites, Breakpoints, etc…)
• Differential Calling (expression, variants, etc)
What am I going to sequence ? For what analysis ?
Technical biases and
limitations
Specific benefits
(Read length, single or paired ends, number of
reads)
Whole genome
Whole exome
Target
enrichment
Size selection –
Stranded/unstranded ?
Amplification
Single Cell Protocol
Length of the read
Single or paired
ends
Number of lanes (depth of
sequencing)
Adapter
Clipping
Quality
trimming
Contaminant and Sequencing
Errors
Biases in GC contents
Bowtie
BWA……
Nature Methods 2009
P Flicek & E Birney
Velvet, Oases
Trinity, SOAP
SSAKE……
PLoS ONE 6(3)
Zhang W, Chen J, et al. (2011)
R, mathlab
& Open Source software
tools
Flowchart of a sequencing
project
Think to the number of replicates
Basic Material for mining sequencing data
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ …
◆
Connect to our server
$ ssh lbcd41.snv.jussieu.fr
$ mkdir <mydir>
$ cd <mydir>
What is this big* fastq file containning ?
→
→
…
…
...
mouse@GED-Server:~/raw_data$ more GKG-13.fastq
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence
+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)
@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
]B]VWaaaaaagggfggggggcggggegdgfgeggbab
@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh
@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
aBa^ddeeehhhhhhhhhhhhhhhhghhhhhhhefff
@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT
+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
aB^^eeeeegcggfffffffcfffgcgcfffffR^^]
@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC
+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
How many sequence reads in my file ?
→ wc - l <path/to/my/file>
mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq
25703828 GKG-13.fastq
mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq
6425957
in python interpreter:
>>> 25703828 / 4
6425957
Are my sequence reads containing the adapter ?
→ cat <path/file> | grep CTGTAGG | wc –l
→ grep -c "CTGTAGG" <path/file>
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of
6 425 957 sequences
… not bad (98.8%)
My 3’ adapter: CTGTAGGCACCATCAAT
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l
308
A contrario
$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content
of a file, line by line
The output is passed
to the input of the
next command
perl interpreter is called
with –ne options (loop
& execute)
In line perl code
Regular expression
The output is passed
to the input of the
next command
wc with –l option
counts the lines
A more advanced example of combining Unix
commands
1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
Clipping adapter sequences
Unix Operating Systems already contain powerful native tools for sequence analyses
cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1n"}' | more
mouse@GED-Server:~/raw_data$
cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$countn"; print
"$1n"}' > clipped_GKG13.fasta
Final command line clipper
Sequence Quality Control
http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/
FastQC, GUI version
http://bowtie-bio.sourceforge.
net/
Bowtie aligns reads on indexed
genomes
mouse@GED-Server:~/instructor$bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --
al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam
A bowtie alignment (command lines)
../genomes/Dmel_r5.49
-f clipped_GKG13.fasta
-v 1
-k 1
-p 6
--al droso_matched_GKG-13.fa
--un unmatched_GKG13.fa
-S
> GKG13_bowtie_output.sam
# reads processed: 5930851
# reads with at least one reported alignment: 4992296 (84.18%)
# reads that failed to align: 938555 (15.82%)
Reported 4992296 alignments to 1 output stream(s)
mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49
Bowtie outputs
deepseq$ ls -laht
-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated
-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa
-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
SAM alignment : $ more GKG13_bowtie_output.sam
Aligned reads: $ more droso_matched_GKG-13.fa
Unaligned reads: $ more unmatched_GKG13.fa
SAM - BAM
Formats
Raw sequence: Fastq (quality), Fasta (w/o quality)
Aligned sequence:
Genome annotation:
GFF, GTF,
Sam
Bam
• Sorted
• Indexed
• Compressed
GFF - GTF
•
•
•
•
•
•
•
•
Pileup Format
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Next week, we will perform an NGS analysis using the Galaxy framework.
We will speak about Accessibility, Reproducibility and Transparency.
Please have a look to http://galaxyproject.org/
You can register and try it
Also, access to http://lbcd41.snv.jussieu.fr with
login: (to be communicated)
password: (to be communicated)
AND
Register (Menu “user” → “register”) with your email address

Pasteur deep seq_analysis_theory_2016

  • 1.
    Deep Seq DataAnalysis Theoretical training Christophe.antoniewski@upmc.fr http://artbio.fr Mouse Genetics January 21, 2016, 13:30–15:00
  • 2.
  • 3.
    Latest commercialized SequencingTechnology e Sequencing-by-pH-variations in ION TORRENT
  • 4.
    Sequencing Technologies :Quantitative Facts
  • 5.
    Sequencing Technologies :Focus on Illumina technology
  • 6.
    Deep sequencing applications ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ Highthroughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
  • 7.
  • 8.
    20-30nt RNA gel purification SmallRNA library (Biases) Library “Bar coding”
  • 9.
  • 10.
    What can Ido with my sequence reads ? ◆ ➢ ◆ … ➢ ◆ … ➢
  • 11.
    Platform Selection Library Preparation Sequencing Quality Control Alignment Assembly Visualization& Statistics • Normalization (library comparison) • Peak finding (Binding sites, Breakpoints, etc…) • Differential Calling (expression, variants, etc) What am I going to sequence ? For what analysis ? Technical biases and limitations Specific benefits (Read length, single or paired ends, number of reads) Whole genome Whole exome Target enrichment Size selection – Stranded/unstranded ? Amplification Single Cell Protocol Length of the read Single or paired ends Number of lanes (depth of sequencing) Adapter Clipping Quality trimming Contaminant and Sequencing Errors Biases in GC contents Bowtie BWA…… Nature Methods 2009 P Flicek & E Birney Velvet, Oases Trinity, SOAP SSAKE…… PLoS ONE 6(3) Zhang W, Chen J, et al. (2011) R, mathlab & Open Source software tools Flowchart of a sequencing project Think to the number of replicates
  • 12.
    Basic Material formining sequencing data ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ … ◆
  • 13.
    Connect to ourserver $ ssh lbcd41.snv.jussieu.fr $ mkdir <mydir> $ cd <mydir>
  • 14.
    What is thisbig* fastq file containning ? → → … … ... mouse@GED-Server:~/raw_data$ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence +HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded) @HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 ]B]VWaaaaaagggfggggggcggggegdgfgeggbab @HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh @HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 aBa^ddeeehhhhhhhhhhhhhhhhghhhhhhhefff @HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT +HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 aB^^eeeeegcggfffffffcfffgcgcfffffR^^] @HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC +HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
  • 15.
    How many sequencereads in my file ? → wc - l <path/to/my/file> mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq 25703828 GKG-13.fastq mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq 6425957 in python interpreter: >>> 25703828 / 4 6425957
  • 16.
    Are my sequencereads containing the adapter ? → cat <path/file> | grep CTGTAGG | wc –l → grep -c "CTGTAGG" <path/file> mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l 6355061 mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq 6355061 6 355 061 out of 6 425 957 sequences … not bad (98.8%) My 3’ adapter: CTGTAGGCACCATCAAT mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l 308 A contrario
  • 17.
    $mouse@GED-Server:~/raw_data$ cat GKG-13.fastq| perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l Outputs the content of a file, line by line The output is passed to the input of the next command perl interpreter is called with –ne options (loop & execute) In line perl code Regular expression The output is passed to the input of the next command wc with –l option counts the lines A more advanced example of combining Unix commands 1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
  • 18.
    Clipping adapter sequences UnixOperating Systems already contain powerful native tools for sequence analyses cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1n"}' | more mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$countn"; print "$1n"}' > clipped_GKG13.fasta Final command line clipper
  • 19.
  • 20.
  • 21.
    mouse@GED-Server:~/instructor$bowtie ../genomes/Dmel_r5.49 -fclipped_GKG13.fasta -v 1 -k 1 -p 6 -- al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam A bowtie alignment (command lines) ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam # reads processed: 5930851 # reads with at least one reported alignment: 4992296 (84.18%) # reads that failed to align: 938555 (15.82%) Reported 4992296 alignments to 1 output stream(s) mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49
  • 22.
    Bowtie outputs deepseq$ ls-laht -rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated -rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa -rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa SAM alignment : $ more GKG13_bowtie_output.sam Aligned reads: $ more droso_matched_GKG-13.fa Unaligned reads: $ more unmatched_GKG13.fa
  • 23.
  • 24.
    Formats Raw sequence: Fastq(quality), Fasta (w/o quality) Aligned sequence: Genome annotation: GFF, GTF, Sam Bam • Sorted • Indexed • Compressed
  • 25.
  • 26.
    Pileup Format seq1 272T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  • 27.
    Next week, wewill perform an NGS analysis using the Galaxy framework. We will speak about Accessibility, Reproducibility and Transparency. Please have a look to http://galaxyproject.org/ You can register and try it Also, access to http://lbcd41.snv.jussieu.fr with login: (to be communicated) password: (to be communicated) AND Register (Menu “user” → “register”) with your email address