Lecture on the annotation of transposable elements

Wednesday
• SSH key and virtual machine
• Theoretical course : annotation in genomes
• Practical course de novo : RepeatModeler and TEdenovo
• Practical course masking : RepeatMasker and TEannot

Clés ssh
Unix: ssh-keygen
Mac: ssh-keygen –t rsa
Windows: use puttygen

Create account
at https://biosphere.france-bioinformatique.fr/
Request the join the group « bioinfo_te_2020 »
Use your public ssh key, starting with « ssh-rsa…”

Detection of repetitive elements
in assembled genomes
Florian Maumus
BioinfoTE CNRS thematic school
28 Sept. – 2 Oct., Fréjus, France

The C-value paradoxe
• Organism complexity does not correlate with increasing C-value
• Similar organisms can show large differences in C-values
• The amount of protein-coding DNA can be a minor fraction of genomic DNA

The C-value paradoxe
Genome size variations can be explained by:
• The size of intergenic regions
• The size of regulatory regions
• The size of introns
• The presence of pseudogenes
• WGD and polyploidization events
• The amount of repetitive DNA
• The amount of genomic dark matter

The Repeatome includes:
Transposable elements
Endogenous viruses
Tandem repeats
Ribozymes
Genes
…
7
Repeat complement = Repeatome

Pioneer approach to quantify
repetitive fraction in genomes
Roy J. Britten
1919 - 2012
• Reassociation kinetic experiments
• The rate at which a particular sequence will
reassociate is proportional to the number of
times it is found in the genome.
E. Coli DNA
Calf DNA
Britten & Kohne, 1968

• Address the diversity and evolution of transposable elements
• Genome masking for gene prediction
• Understand genome structure & composition
• Address genome evolution
• Investigate gene regulation & transcription networks
• Qualify genomic landscapes (e.g. epigenetic marks, 3D organization)
Why is that important to identify repeats
and repeat-derived elements in genomes?

Complexity of the repeatome
Maumus & Quesneville
Current Opinion in Plant Biology 2016

Maize
2.3 Gb genome
Single run of de novo detection
=> 85% repeats
Human
3.2 Gb genome
Decades of annotation
to reach 50% repeats
Different history, different challenges

What are we looking for ?
How can we define a transposable element
from a genomic point of view?
• Autonomous TEs encode the proteins necessary to their own duplication
• TEs have specific structures (e.g. terminal repeats)
• TEs are repetitive in a genome (multicopy)
• We know all kinds of TEs
=> NO, the TE complement is defined by a continuum ranging from
functional copies to genomic dark matter.
 In most genomes, some repetitive elements remain unclassified and we
have no idea what they represent

A - Library-based searches
• banks of transposable element sequence data
B - De novo
• Signature-based
• K-mer – based (strict, extended, cloud)
• High scoring pairs (BLAST-based)
Bioinformatic approaches to identify
repetitive elements in genomes
(interspersed)

! Not a « de novo » approach
For instance, use an existing repeat library if you need to annotate a human
genome
• TE libraries are available for many organisms
• They can be produced from manual curation and/or by automated search
• They can be more or less exhaustive
• They can contain host gene sequences and false positives
• They can be high quality for model species like human or thale cress
• There are different types of databases (general, TE type, host type)
TE sequence databases

Widely used repeat databases : Repbase

2.0 repeat databases : Dfam (HMM profiles)

Signature-based approaches
E. Lerat, Heridity, 2010

E. Lerat, Heridity, 2010
A large choice of tools
…
Finds specific TEs only

• K-mers are used in different ways for repeat detection: count, anchors,
and clouds
• K-mers: oligonucleotides of length « k », e.g. ATGC is a 4-mer
• Considering random distribution and equal base frequency, the
probablility of finding a k-mer is (1/4^k)
• To determine a reasonable oligo length (k) for analyses of different
length genomes (n), we can use the formula “k = log4 (n) + 1”
• K=15 for A. thaliana (120 Mb); K=16 for H. sapiens (3Gb)
• For instance: 3Gb/4^16= 0.7
• The probability of finding 10 occurrences of the same 16-mer is very
low
K-mers

• Rational: Repeats are identical upon integration (duplication event)
• Popular tools: Tallymer, JellyFish, DSK
• How it works? K-mers are simply counted and repetitive k-mers are
mapped on genome sequence
• Pros: Very fast, helps knowing the extent of first layer repeatome
• Cons: Works well only with very recent repeats
…TAGGGTACGTGATGATCCGTAGCTAGCCTAGCTAAAGTCCCGATTTAGC…
…TAGGGTACGTGAAGATCCGTAGCTAGCCTAGCTATAGTCCCGATTTAGC…
…TAGGGTACGAGATGATCCGTAGCTAGCCTAGCTAAAGTCCCCATTTAGC…
K-mers - Simple
time

• Rational: Repeats conserve k-mers that are untouched by random
mutations
• Popular tools: RepeatScout
• How it works? Repetitive k-mers are detected and used as anchors to begin
the alignment of flanking sequences. The process stops when the
alignment score stops increasing and a consensus sequence is derived
• Pros: Fast, provides a library of repetitive elements; filters low complexity
sequences; works well with short repeats (100-200bp)
• Cons: Consensus sequences are often fragmented
K-mers - Extended

• Rational: k-mers that are affected by random mutations can be very similar
to other repetitive k-mers. High local density of these k-mers indicate
sequences that potentially result of duplication
• Popular tools: P-clouds (De Koning et al.)
• How it works? Repetitive k-mers are detected and used to build clouds
together with similar k-mers presenting 1-3 SNPs. It takes either full
genomes or known copies to build the seed.
• Pros: Fast, deep repeatome annotation
• Cons: Risk of high false positive rate, fragmented annotation
In-cloud k-mers
Duplication
K-mers - Clouds

• Rational: Different copies from a repeat family share sequence similarity
over long evolutionary timescales. A given copy should produce BLAST hits
against cognate copies.
• Popular tools: Piler, Recon, Silix
• How it works? The genome is compared to itself using BLASTn. High-scoring
pairs are grouped into clusters on the basis of similarity and coverage.
Sequences from each cluster are aligned together to derive a consensus
sequence.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for more or less conserved families
• Cons: Clustering can be a very long process, sometimes endless
||||||||| || ||||||||||||||||||||| |||||| |||||||
High-scoring pairs (BLAST-based)

High-scoring pairs - pipelines
• Rational: Combining different tools produce more exhaustive repeat libraries
• Popular tools:
• RepeatModeler: Repeatscout + Recon (+ LTR tool in v2)
• TEdenovo: Grouper + Piler + Recon + RepeatScout (+ LTR tool)
• How it works? Several detection tools are launched sequentially or in parallel.
Redundancy between consensus sequences is removed afterwards.
• Pros: Provides a library of potentially full length repetitive elements that
accounts for most repeat families
• Cons: Usually takes a genome subset as input
Maumus & Quesneville, 2014

Extensive combination : Pirate

Consensus sequences: useful
artefacts

Consensus sequences: useful
artefacts
• Simple clustering programs are based on sequence identity, not relationships
• COSEG is a program which automatically identifies repeat subfamilies using
significant co-segregating ( 2-3 bp ) mutations.
• AnTE is a probabilistic approach that infers the likelihood of each copy for
being an ancestral (master) copy
Wacholder et al. 2014

• Find the right tool for each question you are addressing
• Find an acceptable trade-off between
• Sensitivity
• Specificity
• Speed
• Quality
• Few benchmarks are available, they are often not exhaustive (tools)
• Benchmark definition is also loose and variable (data, organisms)
• Compare and/or combine different tools
• Carefully check the settings
• Perform manual curation or at least give an overlook at outputs
• Don’t expect to reach perfection with any tool(s)
Take home

Hands on popular tools
Travaux pratiques
• RepeatModeler (tetools package)
• TEdenovo (REPET package)
Our beast today will be the diatom
Phaeodactylum tricornutum (stramenopile)
Genome size = 27 Mb

Hands on popular tools : RepeatModeler
How it works:
• It performs a first layer of detection with RepeatScout
• Then it performs successive runs of Recon using increasing size of masked
genomic subsets to identify more divergent repeats
• The new version also proposes to run an LTR prediction step
• It proposes a classification tool « RepeatClassifier »

Process overview
Round #1
RepeatScout RS library
TRF mask RepeatMasker Recon Recon
library
Round #2
Round #3
Round #n
library
library
Sample size
RS library
Recon
library
Rmodeler
library
combine

Hands on popular tools : Repeat Modeler
The magic numbers in the RepeatModeler perl script

BuildDatabase -name Pt Pt.fa
RepeatModeler -database Pt -pa 4 1>& rmod.out &
• Two command lines:

sort -k 2,2nr RM_21.FriSep180904562020/round-1/sampleDB-1.fa.lfreq
RepeatModeler – Round-1 (=RepeatScout)
Build k-mer frequency table

RepeatModeler – Round-1 (=RepeatScout)
Build consensus, retrieve copies, filter, and refine consensus for each family
Output library = consensi-refined.fa
Price et al. 2005

RepeatModeler – Round-2 (= BLAST + Recon)
Main steps are as follows:
• BLAST batches against each other
• build families of hits with Recon
• infer consensus
• refine consensus

more RM_21.FriSep180904562020/round-2/msps.out
See the BLAST output; it will be the input for Recon

After the last round of Recon, the two libraries of consensus are combined
Output library = Pt-families.fa
With this example, we get:
• 143 consensus from RepeatScout
• 4 consensus from Recon
Finishing

RepeatClassifier
• Homology-based classification module
• Compares the consensus generated by the various tools to
• RepeatMasker Repeat Protein DB
• RepeatMasker libraries (e.g. Dfam and/or RepBase)
RepeatClassifier -consensi Pt-families.fa

Hands on popular tools : TEdenovo
How it works:
• It performs a single run with the same input based on all-by-all BLAST
• Several steps are paralleleized (slurm or SGE)
• It uses Recon, Piler and Grouper to build repeat families
• It proposes to tun RepeatScout
• It also proposes to run an LTR prediction step
• It includes a classification tool « RepeatClassifier »

+ RepeatScout (v2.2)
REPET Classification utility
Hands on popular tools : Tedenovo
Process overview

The magic numbers in the TEdenovo configuration file (TEdenovo.cfg)

Before starting, the script Preprocess.py can help preparing your sample

PreProcess.py -S 1 -i ABQD01.1.fsa_nt -v 3
Preprocess.py
In this case, we will just run step 1 to make sure that the headers won’t be a source
of errors
see the stats
see what happened to headers
(quite ugly but let's move on)
more ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.stats
grep '>' ABQD01.1.fsa_nt_CheckFasta/ABQD01.1.fsa_nt.formated

Before starting, edit the configuration file « TEdenovo.cfg »
mySQL database credentials
Job manager
Unique identifier (name of fasta file)
Working directory

Essential
banks

nohup launch_TEdenovo.py -P Pt -C TEdenovo.cfg -f MCL
>& TEdenovo.log &
Ready to launch!
This command will launch the 8 steps of TEdenovo sequentially

Step 1: Genomic sequences are cut and grouped into batches
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg –S1

Step 2: All-by-all BLAST of batches
jobs are launched in parallel; check job status with squeue
Have a look at the output file:
Cmd= TEdenovo.py -P Pt -C TEdenovo.cfg -S 2 -s Blaster
less Pt_Blaster/Pt.align.not_over.filtered

Step 3: Clustering of HSPs (High-Scoring Pairs) with Grouper, Piler and Recon
Have a look at the output files:
Piler is very stringent!
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster –c Grouper
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Piler
TEdenovo.py -P Pt -C TEdenovo.cfg -S 3 -s Blaster -c Recon
grep -c '>' Pt_Blaster_Grouper/Pt_Blaster_Grouper_10elem_20seq.fa
449
grep -c '>' Pt_Blaster_Piler/Pt_Blaster_Piler.map.filtered-10-20.flank_size0.fa
11
grep -c '>' Pt_Blaster_Recon/Pt_Blaster_Recon.map.filtered-10-20.flank_size0.fa
416

Step 4: Align HSPs for each cluster and build consensus
Cmd=
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map
TEdenovo.py -P Pt -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map
grep -c '>' Pt_Blaster_Grouper_Map/Pt_Blaster_Grouper_Map_consensus.fa
27
grep -c '>' Pt_Blaster_Piler_Map/Pt_Blaster_Piler_Map_consensus.fa
1
grep -c '>' Pt_Blaster_Recon_Map/Pt_Blaster_Recon_Map_consensus.fa
26

PASTEC: the REPET Classification Utility
Consensus library
TR search
Tandem
Repeat
Finder
BLASTx
tBLASTx
Repbase
Pfam hmm
GyDB hmm
Consensus 1: termLTRs 0,12% TR Bx: AtGypsy; Btx: none profiles: IN, RT LTR retro
Consensus 2: none 0,32% TR Bx: none; Btx: none profiles: LRR Host gene
Consensus 3: none 0,23% TR Bx: none; Btx: none profiles: none Unclassified
rDNA
tRNA
Host
genes
Summary of evidences Proposed
Classification

Step 5: Detect features in consensus sequences
• HMM profiles
• tBLASTx hits vs Repbase
TEdenovo.py -P Pt -C TEdenovo.cfg -S 5 -s Blaster -c
GrpRecPil -m Map

Step 6: Classify, combine and remove redundancy
• Have a look at the output file « .classif »
GrpRecPil -m Map

• Have a look at the output file « . classif_stats.txt »
GrpRecPil -m Map

• Have a look at the output file « Pt_sim_denovoLibTEs.fa »
GrpRecPil -m Map
…

Step 7: Filter unwanted consensus
Step 8: Build groups of related consensus (families)
Have a look at the output file « ._statsPerCluster.tab »
=>3 clusters were found by MCL
TEdenovo done!
GrpRecPil -m Map
GrpRecPil -m Map –f MCL

TEdenovo directory
Step 1
Step 2
Step 3
Step 4
Step 5&6
Step 7
Step 8

• Rational: Use sequence libraries to annotate all homologous portions in
genomes
• Popular tools:
• RepeatMasker, CENSOR, BLASTER
• TEannot: RepeatMasker + CENSOR + BLASTER
• How it works?
• BLAST-based homology search between library and genome with
different parameters and sensitivity
• RepeatMasker also offers an HMM profiles mode
• CENSOR proposes a tBLASTx mode
• Pros: Provides a whole genome annotation
• Cons: Risk of false positives (e.g. low complexity sequences)
Genome Annotation

Hands on popular tools : RepeatMasker
Parameters

RepeatMasker -nolow -no_is -pa 8 -dir . -gff -lib
Pt_sim_denovoLibTEs.fa Pt.fa 1>& rm.out &
Hands on popular tools : RepeatMasker
Run
Main output files:
• Pt_rm.fa.tbl
• Pt_rm.fa.out.gff
• Pt_rm.fa.masked

Hands on popular tools : TEannot
Overview

Before starting: edit the configuration file and check the parameters
Pt.fa; Pt_refTEs.fa

Run
launch_TEannot.py -P Pt -C TEannot.cfg –S 1234578
Step 1: Prepare batches and database
Step 2: Align refTEs against each chunk and against randomized chunks
Step 3: Filter and combine hits (Pt_TEdetect_rnd/threshold.tmp)
Step 4: Search for simple sequence repeats (SSRs)
Step 5: Merge SSR annotations
Step 6: Run BLASTx and tBLASTx with complementary libraries
Step 7: remove spurious hits and perform long-join procedure
Step 8: Generate GFF3 output

export REPET_USER=orepet
export REPET_PW=repet_pw
export REPET_DB=repet
PostAnalyzeTELib.py -a 3 -p
Pt_chr_allTEs_nr_noSSR_join_path -s Pt_refTEs_seq -g
24612623
Postprocess: PostAnalyzeTELib.py
Output file « .stats »
done !

TEannot directory
Step 1
Step 2, Step 3 filtering, Step 7
Step 4 & 5
Step 8
Step 2 random and Step 3 thresholds
PostAnalyzeTELib

More TE ressources
Compilation by Tyler Elliott: 340 tools !!

Lecture on the annotation of transposable elements

Recommended

Recommended

More Related Content

Similar to Lecture on the annotation of transposable elements

Similar to Lecture on the annotation of transposable elements (20)

Recently uploaded

Recently uploaded (20)

Lecture on the annotation of transposable elements