Pasteur deep seq_analysis_theory_2016

Deep Seq Data Analysis
Theoretical training
Christophe.antoniewski@upmc.fr
http://artbio.fr
Mouse Genetics
January 21, 2016, 13:30–15:00

Latest commercialized Sequencing Technology
e Sequencing-by-pH-variations in ION TORRENT

Sequencing Technologies : Quantitative Facts

Sequencing Technologies : Focus on Illumina
technology

Deep sequencing applications
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information

20-30nt RNA gel
purification
Small RNA library
(Biases)
Library “Bar
coding”

ChIPseq library preparation
(Non Directional)

What can I do with my sequence reads ?
◆
➢
◆ …
➢
◆ …
➢

Platform
Selection
Library
Preparation
Sequencing
Quality Control
Alignment Assembly
Visualization & Statistics
• Normalization (library comparison)
• Peak finding (Binding sites, Breakpoints, etc…)
• Differential Calling (expression, variants, etc)
What am I going to sequence ? For what analysis ?
Technical biases and
limitations
Specific benefits
(Read length, single or paired ends, number of
reads)
Whole genome
Whole exome
Target
enrichment
Size selection –
Stranded/unstranded ?
Amplification
Single Cell Protocol
Length of the read
Single or paired
ends
Number of lanes (depth of
sequencing)
Adapter
Clipping
Quality
trimming
Contaminant and Sequencing
Errors
Biases in GC contents
Bowtie
BWA……
Nature Methods 2009
P Flicek & E Birney
Velvet, Oases
Trinity, SOAP
SSAKE……
PLoS ONE 6(3)
Zhang W, Chen J, et al. (2011)
R, mathlab
& Open Source software
tools
Flowchart of a sequencing
project
Think to the number of replicates

Basic Material for mining sequencing data
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ …
◆

Connect to our server
$ ssh lbcd41.snv.jussieu.fr
$ mkdir <mydir>
$ cd <mydir>

What is this big* fastq file containning ?
→
→
…
…
...
mouse@GED-Server:~/raw_data$ more GKG-13.fastq
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence
+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)
@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
]B]VWaaaaaagggfggggggcggggegdgfgeggbab
@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh
@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
aBa^ddeeehhhhhhhhhhhhhhhhghhhhhhhefff
@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT
+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
aB^^eeeeegcggfffffffcfffgcgcfffffR^^]
@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC
+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^

How many sequence reads in my file ?
→ wc - l <path/to/my/file>
mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq
25703828 GKG-13.fastq
mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq
6425957
in python interpreter:
>>> 25703828 / 4
6425957

Are my sequence reads containing the adapter ?
→ cat <path/file> | grep CTGTAGG | wc –l
→ grep -c "CTGTAGG" <path/file>
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of
6 425 957 sequences
… not bad (98.8%)
My 3’ adapter: CTGTAGGCACCATCAAT
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l
308
A contrario

$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content
of a file, line by line
The output is passed
to the input of the
next command
perl interpreter is called
with –ne options (loop
& execute)
In line perl code
Regular expression
The output is passed
to the input of the
next command
wc with –l option
counts the lines
A more advanced example of combining Unix
commands
1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence

Clipping adapter sequences
Unix Operating Systems already contain powerful native tools for sequence analyses
cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1n"}' | more
mouse@GED-Server:~/raw_data$
cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$countn"; print
"$1n"}' > clipped_GKG13.fasta
Final command line clipper

Sequence Quality Control
http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/
FastQC, GUI version

http://bowtie-bio.sourceforge.
net/
Bowtie aligns reads on indexed
genomes

mouse@GED-Server:~/instructor$bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --
al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam
A bowtie alignment (command lines)
../genomes/Dmel_r5.49
-f clipped_GKG13.fasta
-v 1
-k 1
-p 6
--al droso_matched_GKG-13.fa
--un unmatched_GKG13.fa
-S
> GKG13_bowtie_output.sam
# reads processed: 5930851
# reads with at least one reported alignment: 4992296 (84.18%)
# reads that failed to align: 938555 (15.82%)
Reported 4992296 alignments to 1 output stream(s)
mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49

Bowtie outputs
deepseq$ ls -laht
-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated
-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa
-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
SAM alignment : $ more GKG13_bowtie_output.sam
Aligned reads: $ more droso_matched_GKG-13.fa
Unaligned reads: $ more unmatched_GKG13.fa

Formats
Raw sequence: Fastq (quality), Fasta (w/o quality)
Aligned sequence:
Genome annotation:
GFF, GTF,
Sam
Bam
• Sorted
• Indexed
• Compressed

GFF - GTF
•
•
•
•
•
•
•
•

Pileup Format
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Next week, we will perform an NGS analysis using the Galaxy framework.
We will speak about Accessibility, Reproducibility and Transparency.
Please have a look to http://galaxyproject.org/
You can register and try it
Also, access to http://lbcd41.snv.jussieu.fr with
login: (to be communicated)
password: (to be communicated)
AND
Register (Menu “user” → “register”) with your email address

Pasteur deep seq_analysis_theory_2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Pasteur deep seq_analysis_theory_2016

Similar to Pasteur deep seq_analysis_theory_2016 (20)

Recently uploaded

Recently uploaded (20)

Pasteur deep seq_analysis_theory_2016