Lecture on the annotation of transposable elements at the CNRS school "BioinfoTE" in 2020 (Fréjus, France). https://bioinfote.sciencesconf.org/
ORGANIZING COMITEE
Emmanuelle Lerat (LBBE – CNRS Université Lyon 1),
Anna-Sophie Fiston-Lavier (ISEM – Université de Montpellier)
Florian Maumus (URGI – INRAe Versailles)
François Sabot (DIADE – IRD Montpellier)
4. Detection of repetitive elements
in assembled genomes
Florian Maumus
BioinfoTE CNRS thematic school
28 Sept. – 2 Oct., Fréjus, France
5. The C-value paradoxe
• Organism complexity does not correlate with increasing C-value
• Similar organisms can show large differences in C-values
• The amount of protein-coding DNA can be a minor fraction of genomic DNA
6. The C-value paradoxe
Genome size variations can be explained by:
• The size of intergenic regions
• The size of regulatory regions
• The size of introns
• The presence of pseudogenes
• WGD and polyploidization events
• The amount of repetitive DNA
• The amount of genomic dark matter
8. Pioneer approach to quantify
repetitive fraction in genomes
Roy J. Britten
1919 - 2012
• Reassociation kinetic experiments
• The rate at which a particular sequence will
reassociate is proportional to the number of
times it is found in the genome.
E. Coli DNA
Calf DNA
Britten & Kohne, 1968
9. • Address the diversity and evolution of transposable elements
• Genome masking for gene prediction
• Understand genome structure & composition
• Address genome evolution
• Investigate gene regulation & transcription networks
• Qualify genomic landscapes (e.g. epigenetic marks, 3D organization)
Why is that important to identify repeats
and repeat-derived elements in genomes?
11. Complexity of the repeatome
Maumus & Quesneville
Current Opinion in Plant Biology 2016
12. Maize
2.3 Gb genome
Single run of de novo detection
=> 85% repeats
Human
3.2 Gb genome
Decades of annotation
to reach 50% repeats
Different history, different challenges
13. What are we looking for ?
How can we define a transposable element
from a genomic point of view?
• Autonomous TEs encode the proteins necessary to their own duplication
• TEs have specific structures (e.g. terminal repeats)
• TEs are repetitive in a genome (multicopy)
• We know all kinds of TEs
=> NO, the TE complement is defined by a continuum ranging from
functional copies to genomic dark matter.
In most genomes, some repetitive elements remain unclassified and we
have no idea what they represent
14. A - Library-based searches
• banks of transposable element sequence data
B - De novo
• Signature-based
• K-mer – based (strict, extended, cloud)
• High scoring pairs (BLAST-based)
Bioinformatic approaches to identify
repetitive elements in genomes
(interspersed)
15. ! Not a « de novo » approach
For instance, use an existing repeat library if you need to annotate a human
genome
• TE libraries are available for many organisms
• They can be produced from manual curation and/or by automated search
• They can be more or less exhaustive
• They can contain host gene sequences and false positives
• They can be high quality for model species like human or thale cress
• There are different types of databases (general, TE type, host type)
TE sequence databases
23. • K-mers are used in different ways for repeat detection: count, anchors,
and clouds
• K-mers: oligonucleotides of length « k », e.g. ATGC is a 4-mer
• Considering random distribution and equal base frequency, the
probablility of finding a k-mer is (1/4^k)
• To determine a reasonable oligo length (k) for analyses of different
length genomes (n), we can use the formula “k = log4 (n) + 1”
• K=15 for A. thaliana (120 Mb); K=16 for H. sapiens (3Gb)
• For instance: 3Gb/4^16= 0.7
• The probability of finding 10 occurrences of the same 16-mer is very
low
K-mers
24. • Rational: Repeats are identical upon integration (duplication event)
• Popular tools: Tallymer, JellyFish, DSK
• How it works? K-mers are simply counted and repetitive k-mers are
mapped on genome sequence
• Pros: Very fast, helps knowing the extent of first layer repeatome
• Cons: Works well only with very recent repeats
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
K-mers - Simple
time
25. • Rational: Repeats conserve k-mers that are untouched by random
mutations
• Popular tools: RepeatScout
• How it works? Repetitive k-mers are detected and used as anchors to begin
the alignment of flanking sequences. The process stops when the
alignment score stops increasing and a consensus sequence is derived
• Pros: Fast, provides a library of repetitive elements; filters low complexity
sequences; works well with short repeats (100-200bp)
• Cons: Consensus sequences are often fragmented
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
K-mers - Extended
26. • Rational: k-mers that are affected by random mutations can be very similar
to other repetitive k-mers. High local density of these k-mers indicate
sequences that potentially result of duplication
• Popular tools: P-clouds (De Koning et al.)
• How it works? Repetitive k-mers are detected and used to build clouds
together with similar k-mers presenting 1-3 SNPs. It takes either full
genomes or known copies to build the seed.
• Pros: Fast, deep repeatome annotation
• Cons: Risk of high false positive rate, fragmented annotation
In-cloud k-mers
Duplication
K-mers - Clouds
27. • Rational: Different copies from a repeat family share sequence similarity
over long evolutionary timescales. A given copy should produce BLAST hits
against cognate copies.
• Popular tools: Piler, Recon, Silix
• How it works? The genome is compared to itself using BLASTn. High-scoring
pairs are grouped into clusters on the basis of similarity and coverage.
Sequences from each cluster are aligned together to derive a consensus
sequence.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for more or less conserved families
• Cons: Clustering can be a very long process, sometimes endless
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
||||||||| || ||||||||||||||||||||| |||||| |||||||
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
High-scoring pairs (BLAST-based)
28. High-scoring pairs - pipelines
• Rational: Combining different tools produce more exhaustive repeat libraries
• Popular tools:
• RepeatModeler: Repeatscout + Recon (+ LTR tool in v2)
• TEdenovo: Grouper + Piler + Recon + RepeatScout (+ LTR tool)
• How it works? Several detection tools are launched sequentially or in parallel.
Redundancy between consensus sequences is removed afterwards.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for most repeat families
• Cons: Usually takes a genome subset as input
Maumus & Quesneville, 2014
32. Consensus sequences: useful
artefacts
• Simple clustering programs are based on sequence identity, not relationships
• COSEG is a program which automatically identifies repeat subfamilies using
significant co-segregating ( 2-3 bp ) mutations.
• AnTE is a probabilistic approach that infers the likelihood of each copy for
being an ancestral (master) copy
Wacholder et al. 2014
33. • Find the right tool for each question you are addressing
• Find an acceptable trade-off between
• Sensitivity
• Specificity
• Speed
• Quality
• Few benchmarks are available, they are often not exhaustive (tools)
• Benchmark definition is also loose and variable (data, organisms)
• Compare and/or combine different tools
• Carefully check the settings
• Perform manual curation or at least give an overlook at outputs
• Don’t expect to reach perfection with any tool(s)
Take home
34. Hands on popular tools
Travaux pratiques
• RepeatModeler (tetools package)
• TEdenovo (REPET package)
Our beast today will be the diatom
Phaeodactylum tricornutum (stramenopile)
Genome size = 27 Mb
35. Hands on popular tools : RepeatModeler
How it works:
• It performs a first layer of detection with RepeatScout
• Then it performs successive runs of Recon using increasing size of masked
genomic subsets to identify more divergent repeats
• The new version also proposes to run an LTR prediction step
• It proposes a classification tool « RepeatClassifier »
40. RepeatModeler – Round-1 (=RepeatScout)
Build consensus, retrieve copies, filter, and refine consensus for each family
Output library = consensi-refined.fa
Price et al. 2005
41. RepeatModeler – Round-2 (= BLAST + Recon)
Main steps are as follows:
• BLAST batches against each other
• build families of hits with Recon
• infer consensus
• refine consensus
43. After the last round of Recon, the two libraries of consensus are combined
Output library = Pt-families.fa
With this example, we get:
• 143 consensus from RepeatScout
• 4 consensus from Recon
RepeatModeler – Round-2 (= BLAST + Recon)
Finishing
44. RepeatClassifier
• Homology-based classification module
• Compares the consensus generated by the various tools to
• RepeatMasker Repeat Protein DB
• RepeatMasker libraries (e.g. Dfam and/or RepBase)
RepeatClassifier -consensi Pt-families.fa
45. Hands on popular tools : TEdenovo
How it works:
• It performs a single run with the same input based on all-by-all BLAST
• Several steps are paralleleized (slurm or SGE)
• It uses Recon, Piler and Grouper to build repeat families
• It proposes to tun RepeatScout
• It also proposes to run an LTR prediction step
• It includes a classification tool « RepeatClassifier »
49. PreProcess.py -S 1 -i ABQD01.1.fsa_nt -v 3
Preprocess.py
In this case, we will just run step 1 to make sure that the headers won’t be a source
of errors
see the stats
see what happened to headers
(quite ugly but let's move on)
more ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.stats
grep '>' ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.formated
50. Before starting, edit the configuration file « TEdenovo.cfg »
mySQL database credentials
Job manager
Unique identifier (name of fasta file)
Working directory
53. nohup launch_TEdenovo.py -P Pt -C TEdenovo.cfg -f MCL
>& TEdenovo.log &
Hands on popular tools : TEdenovo
Ready to launch!
This command will launch the 8 steps of TEdenovo sequentially
54. Hands on popular tools : TEdenovo
Step 1: Genomic sequences are cut and grouped into batches
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg –S1
55. Hands on popular tools : TEdenovo
Step 2: All-by-all BLAST of batches
jobs are launched in parallel; check job status with squeue
Have a look at the output file:
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 2 -s Blaster
less Pt_Blaster/Pt.align.not_over.filtered
56. Hands on popular tools : TEdenovo
Step 3: Clustering of HSPs (High-Scoring Pairs) with Grouper, Piler and Recon
Have a look at the output files:
Piler is very stringent!
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster –c Grouper
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Piler
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Recon
grep -c '>' Pt_Blaster_Grouper/Pt_Blaster_Grouper_10elem_20seq.fa
449
grep -c '>' Pt_Blaster_Piler/Pt_Blaster_Piler.map.filtered-10-20.flank_size0.fa
11
grep -c '>' Pt_Blaster_Recon/Pt_Blaster_Recon.map.filtered-10-20.flank_size0.fa
416
57. Hands on popular tools : TEdenovo
Step 4: Align HSPs for each cluster and build consensus
Have a look at the output files:
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map
grep -c '>' Pt_Blaster_Grouper_Map/Pt_Blaster_Grouper_Map_consensus.fa
27
grep -c '>' Pt_Blaster_Piler_Map/Pt_Blaster_Piler_Map_consensus.fa
1
grep -c '>' Pt_Blaster_Recon_Map/Pt_Blaster_Recon_Map_consensus.fa
26
59. Hands on popular tools : TEdenovo
Step 5: Detect features in consensus sequences
Have a look at the output files:
• HMM profiles
• tBLASTx hits vs Repbase
TEdenovo.py -P Pt -C TEdenovo.cfg -S 5 -s Blaster -c
GrpRecPil -m Map
60. Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « .classif »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
61. Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « . classif_stats.txt »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
62. Hands on popular tools : TEdenovo
Step 6: Classify, combine and remove redundancy
• Have a look at the output file « Pt_sim_denovoLibTEs.fa »
TEdenovo.py -P Pt -C TEdenovo.cfg -S 6 -s Blaster -c
GrpRecPil -m Map
…
63. Hands on popular tools : TEdenovo
Step 7: Filter unwanted consensus
Step 8: Build groups of related consensus (families)
Have a look at the output file « ._statsPerCluster.tab »
=>3 clusters were found by MCL
TEdenovo done!
TEdenovo.py -P Pt -C TEdenovo.cfg -S 7 -s Blaster -c
GrpRecPil -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 8 -s Blaster -c
GrpRecPil -m Map –f MCL
66. • Rational: Use sequence libraries to annotate all homologous portions in
genomes
• Popular tools:
• RepeatMasker, CENSOR, BLASTER
• TEannot: RepeatMasker + CENSOR + BLASTER
• How it works?
• BLAST-based homology search between library and genome with
different parameters and sensitivity
• RepeatMasker also offers an HMM profiles mode
• CENSOR proposes a tBLASTx mode
• Pros: Provides a whole genome annotation
• Cons: Risk of false positives (e.g. low complexity sequences)
Genome Annotation
70. Hands on popular tools : TEannot
Before starting: edit the configuration file and check the parameters
Pt.fa; Pt_refTEs.fa
71. Hands on popular tools : TEannot
Run
launch_TEannot.py -P Pt -C TEannot.cfg –S 1234578
Step 1: Prepare batches and database
Step 2: Align refTEs against each chunk and against randomized chunks
Step 3: Filter and combine hits (Pt_TEdetect_rnd/threshold.tmp)
Step 4: Search for simple sequence repeats (SSRs)
Step 5: Merge SSR annotations
Step 6: Run BLASTx and tBLASTx with complementary libraries
Step 7: remove spurious hits and perform long-join procedure
Step 8: Generate GFF3 output