SlideShare a Scribd company logo
1 of 74
Download to read offline
Wednesday
• SSH key and virtual machine
• Theoretical course : annotation in genomes
• Practical course de novo : RepeatModeler and TEdenovo
• Practical course masking : RepeatMasker and TEannot
Clés ssh
Unix: ssh-keygen
Mac: ssh-keygen –t rsa
Windows: use puttygen
Create account
at https://biosphere.france-bioinformatique.fr/
Request the join the group « bioinfo_te_2020 »
Use your public ssh key, starting with « ssh-rsa…”
Detection of repetitive elements
in assembled genomes
Florian Maumus
BioinfoTE CNRS thematic school
28 Sept. – 2 Oct., Fréjus, France
The C-value paradoxe
• Organism complexity does not correlate with increasing C-value
• Similar organisms can show large differences in C-values
• The amount of protein-coding DNA can be a minor fraction of genomic DNA
The C-value paradoxe
Genome size variations can be explained by:
• The size of intergenic regions
• The size of regulatory regions
• The size of introns
• The presence of pseudogenes
• WGD and polyploidization events
• The amount of repetitive DNA
• The amount of genomic dark matter
The Repeatome includes:
Transposable elements
Endogenous viruses
Tandem repeats
Ribozymes
Genes
…
7
Repeat complement = Repeatome
Pioneer approach to quantify
repetitive fraction in genomes
Roy J. Britten
1919 - 2012
• Reassociation kinetic experiments
• The rate at which a particular sequence will
reassociate is proportional to the number of
times it is found in the genome.
E. Coli DNA
Calf DNA
Britten & Kohne, 1968
• Address the diversity and evolution of transposable elements
• Genome masking for gene prediction
• Understand genome structure & composition
• Address genome evolution
• Investigate gene regulation & transcription networks
• Qualify genomic landscapes (e.g. epigenetic marks, 3D organization)
Why is that important to identify repeats
and repeat-derived elements in genomes?
Burst and Decay
Complexity of the repeatome
Maumus & Quesneville
Current Opinion in Plant Biology 2016
Maize
2.3 Gb genome
Single run of de novo detection
=> 85% repeats
Human
3.2 Gb genome
Decades of annotation
to reach 50% repeats
Different history, different challenges
What are we looking for ?
How can we define a transposable element
from a genomic point of view?
• Autonomous TEs encode the proteins necessary to their own duplication
• TEs have specific structures (e.g. terminal repeats)
• TEs are repetitive in a genome (multicopy)
• We know all kinds of TEs
=> NO, the TE complement is defined by a continuum ranging from
functional copies to genomic dark matter.
 In most genomes, some repetitive elements remain unclassified and we
have no idea what they represent
A - Library-based searches
• banks of transposable element sequence data
B - De novo
• Signature-based
• K-mer – based (strict, extended, cloud)
• High scoring pairs (BLAST-based)
Bioinformatic approaches to identify
repetitive elements in genomes
(interspersed)
! Not a « de novo » approach
For instance, use an existing repeat library if you need to annotate a human
genome
• TE libraries are available for many organisms
• They can be produced from manual curation and/or by automated search
• They can be more or less exhaustive
• They can contain host gene sequences and false positives
• They can be high quality for model species like human or thale cress
• There are different types of databases (general, TE type, host type)
TE sequence databases
Widely used repeat databases : Repbase
Host-specific databases
Type-specific databases
2.0 repeat databases : Dfam (HMM profiles)
Signature-based approaches
E. Lerat, Heridity, 2010
E. Lerat, Heridity, 2010
A large choice of tools
…
Finds specific TEs only
Burst and Decay
22
• K-mers are used in different ways for repeat detection: count, anchors,
and clouds
• K-mers: oligonucleotides of length « k », e.g. ATGC is a 4-mer
• Considering random distribution and equal base frequency, the
probablility of finding a k-mer is (1/4^k)
• To determine a reasonable oligo length (k) for analyses of different
length genomes (n), we can use the formula “k = log4 (n) + 1”
• K=15 for A. thaliana (120 Mb); K=16 for H. sapiens (3Gb)
• For instance: 3Gb/4^16= 0.7
• The probability of finding 10 occurrences of the same 16-mer is very
low
K-mers
• Rational: Repeats are identical upon integration (duplication event)
• Popular tools: Tallymer, JellyFish, DSK
• How it works? K-mers are simply counted and repetitive k-mers are
mapped on genome sequence
• Pros: Very fast, helps knowing the extent of first layer repeatome
• Cons: Works well only with very recent repeats
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
K-mers - Simple
time
• Rational: Repeats conserve k-mers that are untouched by random
mutations
• Popular tools: RepeatScout
• How it works? Repetitive k-mers are detected and used as anchors to begin
the alignment of flanking sequences. The process stops when the
alignment score stops increasing and a consensus sequence is derived
• Pros: Fast, provides a library of repetitive elements; filters low complexity
sequences; works well with short repeats (100-200bp)
• Cons: Consensus sequences are often fragmented
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
K-mers - Extended
• Rational: k-mers that are affected by random mutations can be very similar
to other repetitive k-mers. High local density of these k-mers indicate
sequences that potentially result of duplication
• Popular tools: P-clouds (De Koning et al.)
• How it works? Repetitive k-mers are detected and used to build clouds
together with similar k-mers presenting 1-3 SNPs. It takes either full
genomes or known copies to build the seed.
• Pros: Fast, deep repeatome annotation
• Cons: Risk of high false positive rate, fragmented annotation
In-cloud k-mers
Duplication
K-mers - Clouds
• Rational: Different copies from a repeat family share sequence similarity
over long evolutionary timescales. A given copy should produce BLAST hits
against cognate copies.
• Popular tools: Piler, Recon, Silix
• How it works? The genome is compared to itself using BLASTn. High-scoring
pairs are grouped into clusters on the basis of similarity and coverage.
Sequences from each cluster are aligned together to derive a consensus
sequence.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for more or less conserved families
• Cons: Clustering can be a very long process, sometimes endless
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
||||||||| || ||||||||||||||||||||| |||||| |||||||
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
High-scoring pairs (BLAST-based)
High-scoring pairs - pipelines
• Rational: Combining different tools produce more exhaustive repeat libraries
• Popular tools:
• RepeatModeler: Repeatscout + Recon (+ LTR tool in v2)
• TEdenovo: Grouper + Piler + Recon + RepeatScout (+ LTR tool)
• How it works? Several detection tools are launched sequentially or in parallel.
Redundancy between consensus sequences is removed afterwards.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for most repeat families
• Cons: Usually takes a genome subset as input
Maumus & Quesneville, 2014
Extensive combination : Pirate
Extensive combination : EDTA
Consensus sequences: useful
artefacts
Consensus sequences: useful
artefacts
• Simple clustering programs are based on sequence identity, not relationships
• COSEG is a program which automatically identifies repeat subfamilies using
significant co-segregating ( 2-3 bp ) mutations.
• AnTE is a probabilistic approach that infers the likelihood of each copy for
being an ancestral (master) copy
Wacholder et al. 2014
• Find the right tool for each question you are addressing
• Find an acceptable trade-off between
• Sensitivity
• Specificity
• Speed
• Quality
• Few benchmarks are available, they are often not exhaustive (tools)
• Benchmark definition is also loose and variable (data, organisms)
• Compare and/or combine different tools
• Carefully check the settings
• Perform manual curation or at least give an overlook at outputs
• Don’t expect to reach perfection with any tool(s)
Take home
Hands on popular tools
Travaux pratiques
• RepeatModeler (tetools package)
• TEdenovo (REPET package)
Our beast today will be the diatom
Phaeodactylum tricornutum (stramenopile)
Genome size = 27 Mb
Hands on popular tools : RepeatModeler
How it works:
• It performs a first layer of detection with RepeatScout
• Then it performs successive runs of Recon using increasing size of masked
genomic subsets to identify more divergent repeats
• The new version also proposes to run an LTR prediction step
• It proposes a classification tool « RepeatClassifier »
Hands on popular tools : RepeatModeler
Process overview
Round #1
RepeatScout RS library
TRF mask RepeatMasker Recon Recon
library
Round #2
Round #3
Round #n
TRF mask RepeatMasker Recon Recon
library
TRF mask RepeatMasker Recon Recon
library
Sample size
RS library
Recon
library
Rmodeler
library
combine
Hands on popular tools : Repeat Modeler
The magic numbers in the RepeatModeler perl script
Hands on popular tools : RepeatModeler
BuildDatabase -name Pt Pt.fa
RepeatModeler -database Pt -pa 4 1>& rmod.out &
• Two command lines:
sort -k 2,2nr RM_21.FriSep180904562020/round-1/sampleDB-1.fa.lfreq
RepeatModeler – Round-1 (=RepeatScout)
Build k-mer frequency table
RepeatModeler – Round-1 (=RepeatScout)
Build consensus, retrieve copies, filter, and refine consensus for each family
Output library = consensi-refined.fa
Price et al. 2005
RepeatModeler – Round-2 (= BLAST + Recon)
Main steps are as follows:
• BLAST batches against each other
• build families of hits with Recon
• infer consensus
• refine consensus
more RM_21.FriSep180904562020/round-2/msps.out
RepeatModeler – Round-2 (= BLAST + Recon)
See the BLAST output; it will be the input for Recon
After the last round of Recon, the two libraries of consensus are combined
Output library = Pt-families.fa
With this example, we get:
• 143 consensus from RepeatScout
• 4 consensus from Recon
RepeatModeler – Round-2 (= BLAST + Recon)
Finishing
RepeatClassifier
• Homology-based classification module
• Compares the consensus generated by the various tools to
• RepeatMasker Repeat Protein DB
• RepeatMasker libraries (e.g. Dfam and/or RepBase)
RepeatClassifier -consensi Pt-families.fa
Hands on popular tools : TEdenovo
How it works:
• It performs a single run with the same input based on all-by-all BLAST
• Several steps are paralleleized (slurm or SGE)
• It uses Recon, Piler and Grouper to build repeat families
• It proposes to tun RepeatScout
• It also proposes to run an LTR prediction step
• It includes a classification tool « RepeatClassifier »
+ RepeatScout (v2.2)
REPET Classification utility
Hands on popular tools : Tedenovo
Process overview
The magic numbers in the TEdenovo configuration file (TEdenovo.cfg)
Hands on popular tools : TEdenovo
Before starting, the script Preprocess.py can help preparing your sample
PreProcess.py -S 1 -i ABQD01.1.fsa_nt -v 3
Preprocess.py
In this case, we will just run step 1 to make sure that the headers won’t be a source
of errors
see the stats
see what happened to headers
(quite ugly but let's move on)
more ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.stats
grep '>' ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.formated
Before starting, edit the configuration file « TEdenovo.cfg »
mySQL database credentials
Job manager
Unique identifier (name of fasta file)
Working directory
Essential
banks
Before starting, edit the configuration file « TEdenovo.cfg »
Before starting, edit the configuration file « TEdenovo.cfg »
nohup launch_TEdenovo.py -P Pt -C TEdenovo.cfg -f MCL
>& TEdenovo.log &
Hands on popular tools : TEdenovo
Ready to launch!
This command will launch the 8 steps of TEdenovo sequentially
Hands on popular tools : TEdenovo
Step 1: Genomic sequences are cut and grouped into batches
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg –S1
Hands on popular tools : TEdenovo
Step 2: All-by-all BLAST of batches
jobs are launched in parallel; check job status with squeue
Have a look at the output file:
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 2 -s Blaster
less Pt_Blaster/Pt.align.not_over.filtered
Hands on popular tools : TEdenovo
Step 3: Clustering of HSPs (High-Scoring Pairs) with Grouper, Piler and Recon
Have a look at the output files:
Piler is very stringent!
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster –c Grouper
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Piler
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Recon
grep -c '>' Pt_Blaster_Grouper/Pt_Blaster_Grouper_10elem_20seq.fa
449
grep -c '>' Pt_Blaster_Piler/Pt_Blaster_Piler.map.filtered-10-20.flank_size0.fa
11
grep -c '>' Pt_Blaster_Recon/Pt_Blaster_Recon.map.filtered-10-20.flank_size0.fa
416
Hands on popular tools : TEdenovo
Step 4: Align HSPs for each cluster and build consensus
Have a look at the output files:
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map
grep -c '>' Pt_Blaster_Grouper_Map/Pt_Blaster_Grouper_Map_consensus.fa
27
grep -c '>' Pt_Blaster_Piler_Map/Pt_Blaster_Piler_Map_consensus.fa
1
grep -c '>' Pt_Blaster_Recon_Map/Pt_Blaster_Recon_Map_consensus.fa
26
PASTEC: the REPET Classification Utility
Consensus library
TR search
Tandem
Repeat
Finder
BLASTx
tBLASTx
Repbase
Pfam hmm
GyDB hmm
Consensus 1: termLTRs 0,12% TR Bx: AtGypsy; Btx: none profiles: IN, RT LTR retro
Consensus 2: none 0,32% TR Bx: none; Btx: none profiles: LRR Host gene
Consensus 3: none 0,23% TR Bx: none; Btx: none profiles: none Unclassified
rDNA
tRNA
Host
genes
Summary of evidences Proposed
Classification
Hands on popular tools : TEdenovo
Step 5: Detect features in consensus sequences
Have a look at the output files:
• HMM profiles
• tBLASTx hits vs Repbase
TEdenovo.py -P Pt -C TEdenovo.cfg -S 5 -s Blaster -c
GrpRecPil -m Map
Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « .classif »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « . classif_stats.txt »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « Pt_sim_denovoLibTEs.fa »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
…
Hands on popular tools : TEdenovo
Step 7: Filter unwanted consensus
Step 8: Build groups of related consensus (families)
Have a look at the output file « ._statsPerCluster.tab »
=>3 clusters were found by MCL
TEdenovo done!
TEdenovo.py -P Pt -C TEdenovo.cfg -S 7 -s Blaster -c
GrpRecPil -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 8 -s Blaster -c
GrpRecPil -m Map –f MCL
TEdenovo directory
Step 1
Step 2
Step 3
Step 4
Step 5&6
Step 7
Step 8
• Rational: Use sequence libraries to annotate all homologous portions in
genomes
• Popular tools:
• RepeatMasker, CENSOR, BLASTER
• TEannot: RepeatMasker + CENSOR + BLASTER
• How it works?
• BLAST-based homology search between library and genome with
different parameters and sensitivity
• RepeatMasker also offers an HMM profiles mode
• CENSOR proposes a tBLASTx mode
• Pros: Provides a whole genome annotation
• Cons: Risk of false positives (e.g. low complexity sequences)
Genome Annotation
Hands on popular tools : RepeatMasker
Parameters
RepeatMasker -nolow -no_is -pa 8 -dir . -gff -lib
Pt_sim_denovoLibTEs.fa Pt.fa 1>& rm.out &
Hands on popular tools : RepeatMasker
Run
Main output files:
• Pt_rm.fa.tbl
• Pt_rm.fa.out.gff
• Pt_rm.fa.masked
Hands on popular tools : TEannot
Overview
Hands on popular tools : TEannot
Before starting: edit the configuration file and check the parameters
Pt.fa; Pt_refTEs.fa
Hands on popular tools : TEannot
Run
launch_TEannot.py -P Pt -C TEannot.cfg –S 1234578
Step 1: Prepare batches and database
Step 2: Align refTEs against each chunk and against randomized chunks
Step 3: Filter and combine hits (Pt_TEdetect_rnd/threshold.tmp)
Step 4: Search for simple sequence repeats (SSRs)
Step 5: Merge SSR annotations
Step 6: Run BLASTx and tBLASTx with complementary libraries
Step 7: remove spurious hits and perform long-join procedure
Step 8: Generate GFF3 output
export REPET_USER=orepet
export REPET_PW=repet_pw
export REPET_DB=repet
PostAnalyzeTELib.py -a 3 -p
Pt_chr_allTEs_nr_noSSR_join_path -s Pt_refTEs_seq -g
24612623
Hands on popular tools : TEannot
Postprocess: PostAnalyzeTELib.py
Output file « .stats »
done !
TEannot directory
Step 1
Step 2, Step 3 filtering, Step 7
Step 4 & 5
Step 8
Step 2 random and Step 3 thresholds
PostAnalyzeTELib
More TE ressources
Compilation by Tyler Elliott: 340 tools !!

More Related Content

Similar to Lecture on the annotation of transposable elements

Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Promila Sheoran
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?Nick Loman
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
CoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptxCoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptxVandana472475
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqTimothy Tickle
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
Metagenomic analysis
Metagenomic analysisMetagenomic analysis
Metagenomic analysisAnimesh Kumar
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
Lightning
LightningLightning
LightningArvados
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Handsfnothaft
 

Similar to Lecture on the annotation of transposable elements (20)

Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
04_Assembly_2022.pdf
04_Assembly_2022.pdf04_Assembly_2022.pdf
04_Assembly_2022.pdf
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
CoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptxCoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptx
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Metagenomic analysis
Metagenomic analysisMetagenomic analysis
Metagenomic analysis
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
Lightning
LightningLightning
Lightning
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

Lecture on the annotation of transposable elements

  • 1. Wednesday • SSH key and virtual machine • Theoretical course : annotation in genomes • Practical course de novo : RepeatModeler and TEdenovo • Practical course masking : RepeatMasker and TEannot
  • 2. Clés ssh Unix: ssh-keygen Mac: ssh-keygen –t rsa Windows: use puttygen
  • 3. Create account at https://biosphere.france-bioinformatique.fr/ Request the join the group « bioinfo_te_2020 » Use your public ssh key, starting with « ssh-rsa…”
  • 4. Detection of repetitive elements in assembled genomes Florian Maumus BioinfoTE CNRS thematic school 28 Sept. – 2 Oct., Fréjus, France
  • 5. The C-value paradoxe • Organism complexity does not correlate with increasing C-value • Similar organisms can show large differences in C-values • The amount of protein-coding DNA can be a minor fraction of genomic DNA
  • 6. The C-value paradoxe Genome size variations can be explained by: • The size of intergenic regions • The size of regulatory regions • The size of introns • The presence of pseudogenes • WGD and polyploidization events • The amount of repetitive DNA • The amount of genomic dark matter
  • 7. The Repeatome includes: Transposable elements Endogenous viruses Tandem repeats Ribozymes Genes … 7 Repeat complement = Repeatome
  • 8. Pioneer approach to quantify repetitive fraction in genomes Roy J. Britten 1919 - 2012 • Reassociation kinetic experiments • The rate at which a particular sequence will reassociate is proportional to the number of times it is found in the genome. E. Coli DNA Calf DNA Britten & Kohne, 1968
  • 9. • Address the diversity and evolution of transposable elements • Genome masking for gene prediction • Understand genome structure & composition • Address genome evolution • Investigate gene regulation & transcription networks • Qualify genomic landscapes (e.g. epigenetic marks, 3D organization) Why is that important to identify repeats and repeat-derived elements in genomes?
  • 11. Complexity of the repeatome Maumus & Quesneville Current Opinion in Plant Biology 2016
  • 12. Maize 2.3 Gb genome Single run of de novo detection => 85% repeats Human 3.2 Gb genome Decades of annotation to reach 50% repeats Different history, different challenges
  • 13. What are we looking for ? How can we define a transposable element from a genomic point of view? • Autonomous TEs encode the proteins necessary to their own duplication • TEs have specific structures (e.g. terminal repeats) • TEs are repetitive in a genome (multicopy) • We know all kinds of TEs => NO, the TE complement is defined by a continuum ranging from functional copies to genomic dark matter.  In most genomes, some repetitive elements remain unclassified and we have no idea what they represent
  • 14. A - Library-based searches • banks of transposable element sequence data B - De novo • Signature-based • K-mer – based (strict, extended, cloud) • High scoring pairs (BLAST-based) Bioinformatic approaches to identify repetitive elements in genomes (interspersed)
  • 15. ! Not a « de novo » approach For instance, use an existing repeat library if you need to annotate a human genome • TE libraries are available for many organisms • They can be produced from manual curation and/or by automated search • They can be more or less exhaustive • They can contain host gene sequences and false positives • They can be high quality for model species like human or thale cress • There are different types of databases (general, TE type, host type) TE sequence databases
  • 16. Widely used repeat databases : Repbase
  • 19. 2.0 repeat databases : Dfam (HMM profiles)
  • 21. E. Lerat, Heridity, 2010 A large choice of tools … Finds specific TEs only
  • 23. • K-mers are used in different ways for repeat detection: count, anchors, and clouds • K-mers: oligonucleotides of length « k », e.g. ATGC is a 4-mer • Considering random distribution and equal base frequency, the probablility of finding a k-mer is (1/4^k) • To determine a reasonable oligo length (k) for analyses of different length genomes (n), we can use the formula “k = log4 (n) + 1” • K=15 for A. thaliana (120 Mb); K=16 for H. sapiens (3Gb) • For instance: 3Gb/4^16= 0.7 • The probability of finding 10 occurrences of the same 16-mer is very low K-mers
  • 24. • Rational: Repeats are identical upon integration (duplication event) • Popular tools: Tallymer, JellyFish, DSK • How it works? K-mers are simply counted and repetitive k-mers are mapped on genome sequence • Pros: Very fast, helps knowing the extent of first layer repeatome • Cons: Works well only with very recent repeats …TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC… …TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC… …TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC… …TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC… …TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC… …TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC… K-mers - Simple time
  • 25. • Rational: Repeats conserve k-mers that are untouched by random mutations • Popular tools: RepeatScout • How it works? Repetitive k-mers are detected and used as anchors to begin the alignment of flanking sequences. The process stops when the alignment score stops increasing and a consensus sequence is derived • Pros: Fast, provides a library of repetitive elements; filters low complexity sequences; works well with short repeats (100-200bp) • Cons: Consensus sequences are often fragmented …TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC… …TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC… …TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC… K-mers - Extended
  • 26. • Rational: k-mers that are affected by random mutations can be very similar to other repetitive k-mers. High local density of these k-mers indicate sequences that potentially result of duplication • Popular tools: P-clouds (De Koning et al.) • How it works? Repetitive k-mers are detected and used to build clouds together with similar k-mers presenting 1-3 SNPs. It takes either full genomes or known copies to build the seed. • Pros: Fast, deep repeatome annotation • Cons: Risk of high false positive rate, fragmented annotation In-cloud k-mers Duplication K-mers - Clouds
  • 27. • Rational: Different copies from a repeat family share sequence similarity over long evolutionary timescales. A given copy should produce BLAST hits against cognate copies. • Popular tools: Piler, Recon, Silix • How it works? The genome is compared to itself using BLASTn. High-scoring pairs are grouped into clusters on the basis of similarity and coverage. Sequences from each cluster are aligned together to derive a consensus sequence. • Pros: Provides a library of potentially full length repetitive elements that accounts for more or less conserved families • Cons: Clustering can be a very long process, sometimes endless …TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC… ||||||||| || ||||||||||||||||||||| |||||| ||||||| …TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC… High-scoring pairs (BLAST-based)
  • 28. High-scoring pairs - pipelines • Rational: Combining different tools produce more exhaustive repeat libraries • Popular tools: • RepeatModeler: Repeatscout + Recon (+ LTR tool in v2) • TEdenovo: Grouper + Piler + Recon + RepeatScout (+ LTR tool) • How it works? Several detection tools are launched sequentially or in parallel. Redundancy between consensus sequences is removed afterwards. • Pros: Provides a library of potentially full length repetitive elements that accounts for most repeat families • Cons: Usually takes a genome subset as input Maumus & Quesneville, 2014
  • 32. Consensus sequences: useful artefacts • Simple clustering programs are based on sequence identity, not relationships • COSEG is a program which automatically identifies repeat subfamilies using significant co-segregating ( 2-3 bp ) mutations. • AnTE is a probabilistic approach that infers the likelihood of each copy for being an ancestral (master) copy Wacholder et al. 2014
  • 33. • Find the right tool for each question you are addressing • Find an acceptable trade-off between • Sensitivity • Specificity • Speed • Quality • Few benchmarks are available, they are often not exhaustive (tools) • Benchmark definition is also loose and variable (data, organisms) • Compare and/or combine different tools • Carefully check the settings • Perform manual curation or at least give an overlook at outputs • Don’t expect to reach perfection with any tool(s) Take home
  • 34. Hands on popular tools Travaux pratiques • RepeatModeler (tetools package) • TEdenovo (REPET package) Our beast today will be the diatom Phaeodactylum tricornutum (stramenopile) Genome size = 27 Mb
  • 35. Hands on popular tools : RepeatModeler How it works: • It performs a first layer of detection with RepeatScout • Then it performs successive runs of Recon using increasing size of masked genomic subsets to identify more divergent repeats • The new version also proposes to run an LTR prediction step • It proposes a classification tool « RepeatClassifier »
  • 36. Hands on popular tools : RepeatModeler Process overview Round #1 RepeatScout RS library TRF mask RepeatMasker Recon Recon library Round #2 Round #3 Round #n TRF mask RepeatMasker Recon Recon library TRF mask RepeatMasker Recon Recon library Sample size RS library Recon library Rmodeler library combine
  • 37. Hands on popular tools : Repeat Modeler The magic numbers in the RepeatModeler perl script
  • 38. Hands on popular tools : RepeatModeler BuildDatabase -name Pt Pt.fa RepeatModeler -database Pt -pa 4 1>& rmod.out & • Two command lines:
  • 39. sort -k 2,2nr RM_21.FriSep180904562020/round-1/sampleDB-1.fa.lfreq RepeatModeler – Round-1 (=RepeatScout) Build k-mer frequency table
  • 40. RepeatModeler – Round-1 (=RepeatScout) Build consensus, retrieve copies, filter, and refine consensus for each family Output library = consensi-refined.fa Price et al. 2005
  • 41. RepeatModeler – Round-2 (= BLAST + Recon) Main steps are as follows: • BLAST batches against each other • build families of hits with Recon • infer consensus • refine consensus
  • 42. more RM_21.FriSep180904562020/round-2/msps.out RepeatModeler – Round-2 (= BLAST + Recon) See the BLAST output; it will be the input for Recon
  • 43. After the last round of Recon, the two libraries of consensus are combined Output library = Pt-families.fa With this example, we get: • 143 consensus from RepeatScout • 4 consensus from Recon RepeatModeler – Round-2 (= BLAST + Recon) Finishing
  • 44. RepeatClassifier • Homology-based classification module • Compares the consensus generated by the various tools to • RepeatMasker Repeat Protein DB • RepeatMasker libraries (e.g. Dfam and/or RepBase) RepeatClassifier -consensi Pt-families.fa
  • 45. Hands on popular tools : TEdenovo How it works: • It performs a single run with the same input based on all-by-all BLAST • Several steps are paralleleized (slurm or SGE) • It uses Recon, Piler and Grouper to build repeat families • It proposes to tun RepeatScout • It also proposes to run an LTR prediction step • It includes a classification tool « RepeatClassifier »
  • 46. + RepeatScout (v2.2) REPET Classification utility Hands on popular tools : Tedenovo Process overview
  • 47. The magic numbers in the TEdenovo configuration file (TEdenovo.cfg) Hands on popular tools : TEdenovo
  • 48. Before starting, the script Preprocess.py can help preparing your sample
  • 49. PreProcess.py -S 1 -i ABQD01.1.fsa_nt -v 3 Preprocess.py In this case, we will just run step 1 to make sure that the headers won’t be a source of errors see the stats see what happened to headers (quite ugly but let's move on) more ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.stats grep '>' ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.formated
  • 50. Before starting, edit the configuration file « TEdenovo.cfg » mySQL database credentials Job manager Unique identifier (name of fasta file) Working directory
  • 51. Essential banks Before starting, edit the configuration file « TEdenovo.cfg »
  • 52. Before starting, edit the configuration file « TEdenovo.cfg »
  • 53. nohup launch_TEdenovo.py -P Pt -C TEdenovo.cfg -f MCL >& TEdenovo.log & Hands on popular tools : TEdenovo Ready to launch! This command will launch the 8 steps of TEdenovo sequentially
  • 54. Hands on popular tools : TEdenovo Step 1: Genomic sequences are cut and grouped into batches Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg –S1
  • 55. Hands on popular tools : TEdenovo Step 2: All-by-all BLAST of batches jobs are launched in parallel; check job status with squeue Have a look at the output file: Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 2 -s Blaster less Pt_Blaster/Pt.align.not_over.filtered
  • 56. Hands on popular tools : TEdenovo Step 3: Clustering of HSPs (High-Scoring Pairs) with Grouper, Piler and Recon Have a look at the output files: Piler is very stringent! Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster –c Grouper TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Piler TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Recon grep -c '>' Pt_Blaster_Grouper/Pt_Blaster_Grouper_10elem_20seq.fa 449 grep -c '>' Pt_Blaster_Piler/Pt_Blaster_Piler.map.filtered-10-20.flank_size0.fa 11 grep -c '>' Pt_Blaster_Recon/Pt_Blaster_Recon.map.filtered-10-20.flank_size0.fa 416
  • 57. Hands on popular tools : TEdenovo Step 4: Align HSPs for each cluster and build consensus Have a look at the output files: Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map grep -c '>' Pt_Blaster_Grouper_Map/Pt_Blaster_Grouper_Map_consensus.fa 27 grep -c '>' Pt_Blaster_Piler_Map/Pt_Blaster_Piler_Map_consensus.fa 1 grep -c '>' Pt_Blaster_Recon_Map/Pt_Blaster_Recon_Map_consensus.fa 26
  • 58. PASTEC: the REPET Classification Utility Consensus library TR search Tandem Repeat Finder BLASTx tBLASTx Repbase Pfam hmm GyDB hmm Consensus 1: termLTRs 0,12% TR Bx: AtGypsy; Btx: none profiles: IN, RT LTR retro Consensus 2: none 0,32% TR Bx: none; Btx: none profiles: LRR Host gene Consensus 3: none 0,23% TR Bx: none; Btx: none profiles: none Unclassified rDNA tRNA Host genes Summary of evidences Proposed Classification
  • 59. Hands on popular tools : TEdenovo Step 5: Detect features in consensus sequences Have a look at the output files: • HMM profiles • tBLASTx hits vs Repbase TEdenovo.py -P Pt -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map
  • 60. Hands on popular tools : TEdenovo Step 6: Classify, combine and remove redundancy • Have a look at the output file « .classif » TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map
  • 61. Hands on popular tools : TEdenovo Step 6: Classify, combine and remove redundancy • Have a look at the output file « . classif_stats.txt » TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map
  • 62. Hands on popular tools : TEdenovo Step 6: Classify, combine and remove redundancy • Have a look at the output file « Pt_sim_denovoLibTEs.fa » TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map …
  • 63. Hands on popular tools : TEdenovo Step 7: Filter unwanted consensus Step 8: Build groups of related consensus (families) Have a look at the output file « ._statsPerCluster.tab » =>3 clusters were found by MCL TEdenovo done! TEdenovo.py -P Pt -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map TEdenovo.py -P Pt -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map –f MCL
  • 64. TEdenovo directory Step 1 Step 2 Step 3 Step 4 Step 5&6 Step 7 Step 8
  • 65.
  • 66. • Rational: Use sequence libraries to annotate all homologous portions in genomes • Popular tools: • RepeatMasker, CENSOR, BLASTER • TEannot: RepeatMasker + CENSOR + BLASTER • How it works? • BLAST-based homology search between library and genome with different parameters and sensitivity • RepeatMasker also offers an HMM profiles mode • CENSOR proposes a tBLASTx mode • Pros: Provides a whole genome annotation • Cons: Risk of false positives (e.g. low complexity sequences) Genome Annotation
  • 67. Hands on popular tools : RepeatMasker Parameters
  • 68. RepeatMasker -nolow -no_is -pa 8 -dir . -gff -lib Pt_sim_denovoLibTEs.fa Pt.fa 1>& rm.out & Hands on popular tools : RepeatMasker Run Main output files: • Pt_rm.fa.tbl • Pt_rm.fa.out.gff • Pt_rm.fa.masked
  • 69. Hands on popular tools : TEannot Overview
  • 70. Hands on popular tools : TEannot Before starting: edit the configuration file and check the parameters Pt.fa; Pt_refTEs.fa
  • 71. Hands on popular tools : TEannot Run launch_TEannot.py -P Pt -C TEannot.cfg –S 1234578 Step 1: Prepare batches and database Step 2: Align refTEs against each chunk and against randomized chunks Step 3: Filter and combine hits (Pt_TEdetect_rnd/threshold.tmp) Step 4: Search for simple sequence repeats (SSRs) Step 5: Merge SSR annotations Step 6: Run BLASTx and tBLASTx with complementary libraries Step 7: remove spurious hits and perform long-join procedure Step 8: Generate GFF3 output
  • 72. export REPET_USER=orepet export REPET_PW=repet_pw export REPET_DB=repet PostAnalyzeTELib.py -a 3 -p Pt_chr_allTEs_nr_noSSR_join_path -s Pt_refTEs_seq -g 24612623 Hands on popular tools : TEannot Postprocess: PostAnalyzeTELib.py Output file « .stats » done !
  • 73. TEannot directory Step 1 Step 2, Step 3 filtering, Step 7 Step 4 & 5 Step 8 Step 2 random and Step 3 thresholds PostAnalyzeTELib
  • 74. More TE ressources Compilation by Tyler Elliott: 340 tools !!