Bentham & Hooker's Classification. along with the merits and demerits of the ...
What should Bioinformatics do for EvoDevo?
1. Insights into the evolution and
development of planarian
regeneration from the genome of the
flatworm Girardia tigrina
SUJAI KUMAR
2014-07-24 VIENNA EURO EVODEVO
WHAT SHOULD
BIOINFORMATICS DO FOR
EVODEVO?
3. SUJAI KUMAR
"Winkel triple projection SW" by Strebe - Own work
Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Winkel_triple_projection_SW.jpg
Cartoonist and
mathematics
teacher in
New Delhi
4. SUJAI KUMAR
Finding patterns in
sequences:
TIMSS 1999 video study
MS in
Educational
Psychology at
the University of
Illinois
8. Outline of this talk
1. Regeneration, planarian flatworms, and Girardia tigrina
2. Creating G tigrina genomic resources
3. Using these resources to understand regeneration
4. What should bioinformatics do for EvoDevo
10. 1. Regeneration,
planarian flatworms,
and Girardia tigrina
Kao, 2014. PhD Thesis “Transcriptome assembly and analysis
of the freshwater planarian Schmidtea mediterranea”
Platyhelminthes
Cestoda
Monogenea
Trematoda
Rhabditophora
Turbellaria
Tricladida
Macrostomorpha
Lecithoepitheliata
Rhabdocoela
T
T
T
T
T
T
Girardia tigrina
aboobakerlab.com/genomes
G
Schmidtea mediterranea
smedgd.neuro.utah.edu
G
Polycladida
11. 1. Regeneration,
planarian flatworms,
and Girardia tigrina
• What we know already
• Some genes and pathways that are essential for WBR
• Some transcription expression profiles
• No transgenics in any planarian
17. 2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Quality Control
• Raw data QC fastqc
• Preliminary assembly Blobology
• Separate components contaminants/ endosymbionts/ mitochondrial
• Assess insert sizes Bad mate pair libraries confound scaffolding
• Generate many assemblies
• ABySS, CLC, MaSurCA, SGA, Spades, ALLPATHS-LG
• Evaluate assemblies
• FRCbam, REAPR, CGAL
• CEGMA, alignments to known sequences
• Freeze and release
18. 2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• NOT a great assembly
• But it was GoodEnough™
• Next version with long-insert mate pairs
• Diploid, but high heterozygosity
Assembly version nGt.0.3 nGt.0.5
Raw read data ~500M short read pairs
160 GBases
Consolidating near identical
contigs
Total Span Gbases 1.898 1.500
Num Contigs 581,558 422,617
Span Contigs >10kb 541,653,308 536,575,093
Num Contigs >10kb 29,050 27,495
N50 5,751 6,827
CEGMA 45% 56%
19. 2. Creating G tigrina genomic resources
Sequencing > Assembly > Annotation > Delivery
• Gene prediction
• RNA-seq
• Predictors Augustus, SNAP, GeneMark
• Consolidators MAKER, EVM, ENSEMBL genebuild
• Evaluate use Annotation Edit Distance (AED) as a metric
• Functional annotation
• InterProScan, Trinotate, Blast2GO
• Community annotation
• WebApollo, Community Annotation Portal
Annotation
Version
Num of
Genes
Num of Genes with
AED>0.5
Mean aa
length
Num of Genes with
InterPro annotations
nGt.0.5.1 39,119 35,061 268 22,747
21. 3. Using these resources to understand
regeneration
• Individual genes and pathways
• Transgenics
• Protein ortholog analysis
• 4 triclads, 1 other platyhelminth, 2 ecdysozoa, 4 deuterostomes
• 14k out of 40k G tigrina proteins in strict ortholog clusters
• ~8000 triclad-specific clusters
• ~800 triclad-specific clusters with all 4 species represented
• Cis-regulatory analysis
• Neoblast specific regulatory regions
22. 4. What should bioinformatics do for EvoDevo
• What should I do for an experimental EvoDevo lab
• Visual > Text
• View additional information in place
• Plot everything vs everything
• Create gene models visually
• Routine analyses should not require bioinformatician
• Clear explanations of how a resource was created
• Not too many versions
• Minimum standards
23. 4. What should bioinformatics do for EvoDevo
• What should the bioinformatics community do for me as an
EvoDevo bioinformatician
• Best practice documentation for analyses
• Easy to install tools
• Minimum standards for assembly, metadata, annotation, and delivery
• Grants for coordination, tools, resources
24. Summary
• Please use the resources at aboobakerlab.com/genomes
• Tell us what other resources you’d like to see as standard
• Fund technology development and training
25. Acknowledgements
• AboobakerLab.com
• Aziz Aboobaker
• Natalia Pouchkina-Stantcheva
• Damian Kao
• Yuliana Mihaylova
• Aphrodite Zhao
• Blaxter Lab (nematodes.org)
• Ben Elsworth (Badger)
• Sequencing
• Edinburgh Genomics
• Funding
• BBSRC
• BSDB / Company of Biologists travel grant
Editor's Notes
Target audience
Biologists who want genomic resources for their favourite species but are not sure of what is possible
Bioinformaticians who are creating these resources
Pic of sequencing from Lex – highlight technologies, put costs/advantages on side
Drosophila contigs better than Sanger Sequencing
42 SMRT for 160 MB cells, approx cost for 60 = 18,000, fo 40 should cost ~12,000
PacBio promising 4X improvement so even lower
Human 54X coverage ~ 40,000 so much less
1 in 200 bp
1 in 500 bp
Delivery – whatever you choose, be sure it can support a large number of draft contigs/scaffolds – many tools that work well for a few chroosomes or a few hundred scaffolds don’t work so well for thousands (as we had)
Think of a good way to get
Triclad only
Lopho only
Protostomia only
All
---
(rename sp as cols, and check cols) using regexp (with counts)
Work in progress. We do some things ok.
1. Planmine and Badger vs FTP downloads
2. Additional info in one place (don’t make users click around too much). Gbrowse is fabulous - http://banana-genome.cirad.fr great example, inline help as well
3.
In the future – on the fly? (what takes time is the computation and recomputation for a whole genome, but individual genes/contigs should be doable on the fly)
3.
Say something provocative
Work in progress. We do some things ok.
1. Compare badger to – FTP download site (because it has descriptive information)
2. Additional info in one place (don’t make users click around too much). Gbrowse is fabulous - http://banana-genome.cirad.fr great example, inline help as well
In the future – on the fly? (what takes time is the computation and recomputation for a whole genome, but individual genes/contigs should be doable on the fly)
3.
Say something provocative