Acknowledgements
Thanks go to Elizabeth Smikle, Jireh Agda, Nolan Dickson, Dina Soliman and Megan Milton for programming,
image generation and layout support.
Next Steps
• What types of data need to be included
to make RepeatFUnL useful?
• What formats should be
supported/compatible?
• Please fill out our questionnaire at
www.repeatfunl.org
• Gaining cooperation and support of MGE
and repeat community and private
databases
• Please contact if you are interested in
furthering this project
• telliott@boldsystems.org
• @TransposableMan
Future Goals
• Provide analysis tools to aid in data
curation and generation for users
• Serve as a platform to enforce
community developed standards for
MGE and repeat annotation and
classification
• Develop teaching applications to
introduce students to genomic data and
curation
• Understand the impact of MGEs and
repeats on phenotypic variation and
disease across the Tree of Life
• Unravel the evolutionary diversity of
MGEs and other mobile DNA
Value Added by RepeatFUnL
• Aggregate data across sources in single,
searchable format for easy download
• Build off expertise and reputation of the
Centre for Biodiversity Genomics in
developing and maintaining mature
sequence databases and NGS analysis
resources (BOLD, mBRAVE)
• Make computational intensive data
generated by experts more discoverable
and usable to general scientific
community
• Universal data schema for repeat and
MGE transactions and storage of data
RepeatFUnL: Filterable Universal
Library
• RepeatFUnL will aggregate MGE and
repeat information across databases,
support and enhance current databases
rather than replace them
• The central units of RepeatFUnL are
Repeat Records
• Data stored in NoSQL format to aid in
searching and filtering a large distributed
dataset
• Will include data from databases, primary
literature, uploaded from users and
generated de novo
Repeat Data Challenges
• Mobile genetic element (MGE) and repeat information is of value for a variety of disciplines
(evolution, ecology, agriculture, medicine, biotechnology)
• MGE and repeat data is difficult to generate, requires curation, with few standards for storage,
classification and annotation
• Long read and cheaper sequencing will enable large projects to generate millions of genomes
over the next decade and managing repeat information will be crucial (Figure 1)
• Many databases exist (Table 1), but these can be hard to search and download, along with data
being duplicated and fragmentated across multiple databases
• Repeat information would greatly benefit from better connectivity and searchability
Analyze
Download Upload
Curate
Collaborate Search
Tyler A. Elliott and Sujeevan Ratnasingham
Centre for Biodiversity Genomics, University of Guelph, Ontario, Canada
Developing a comprehensive, integrative repeat database for the broad
scientific community
Genomes Databases
MGE/Repeat
Community Literature
Figure 1. Projected growth in genomes sequenced over the next decade.
Table 1. Current repeat and MGE information. * indicates an underestimate.
MGE/Repeat Statistic Number
MGE records in Databases 1.3 million
Accessions with MGEs in
GenBank
6 million*
Repeat records in
Databases
8 million
Species with MGE/repeat
records
~3000
Taxonomy
Repeat Records
References
Associated Data
#
External IDs
0.0
2.5
5.0
7.5
10.0
2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028
Year
Genomes(millions)
Archaea and Bacteria
Eukaryote
Plasmid
Virus
Number of Genomes Sequenced