C. Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
May 2014
ctb@msu.edu
Applying mRNAseq to non...
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll....
Sequencing has become very
inexpensive.
Sequencing costs
 Approximately $1000 of mRNAseq will yield a
decent transcriptome.
 Multiple samples will allow you to ...
Mapping => quantitation
Reference transcriptome required.
Interpreting RNAseq requires gene
models:
http://www.hitseq.com/images/RNA-seq_AS.jpg
The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most ex...
Outline
1. Challenges of non-model
transcriptomics.
2. Lamprey: too much data, not enough
genome
3. Digital normalization ...
Sea lamprey in the Great Lakes
 Non-native
 Parasite of
medium to large
fishes
 Caused
populations of
host fishes to
cr...
The problem of lamprey:
 Diverged at base of vertebrates;
evolutionarily distant from model
organisms.
 Large, complicat...
Lamprey has incomplete genomic sequence
J. Smith et al., PNAS 2009
Evidence of somatic recombination;
100s of mb of sequen...
Lamprey tissues for which we have
mRNAseq
embryo stages (late blastula,
gastrula, neurula, 22b, neural-
crest migration, 2...
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolish...
Shared low-level
transcripts may
not reach the
threshold for
assembly.
Main problem (4 years ago):
We have a massive amount of data
that challenges existing computers
when we try to assemble it...
Solution: Digital normalization
(a computational version of library normalization)
Suppose you have a
dilution factor of A...
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each r...
Evaluating diginorm – how?
 Can’t assemble lamprey w/o
diginorm; are results any good &
how would we know?
 Need compara...
Looking at the Molgula…
Putnam et al., 2008,
Nature.Modified from Swalla 2001
Sea squirts!
Molgula oculata
Molgula occulta
Molgula oculata Ciona intestinalis
Elijah Lowe; collaboration w/Billie Swalla
Challenging organisms to work on --
 Only spawn ~1 month out of the year
 Located off the northern coast of France (Rosc...
Tail loss and notochord genes
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orang...
Diginorm applied to Molgula
embryonic mRNAseq
Substantial
time savings
(3-5x)
<< RAM
Elijah Lowe
Question: does it matter what
assembly pipeline you use? (No)
3
70
25
1
36
13563
35
13
7
4 23 8 1
6
5
Diginorm V/O Raw V/O...
How complete are these
transcriptomes?
Elijah Lowe
Shift in differentially expressed genes
from gastrulation to neurulation
M. ocu vs. M. occ gastrula M. ocu vs. M. occ neur...
Notochord gene expression similar to
tailed species
-10 -5 0 5 10 15
-10-5051015
Expression difference Hybrid vs Parent sp...
M. occulta transgenic NoTrlc
Alberto Stolfi & Lionel Christiaen
Lionel Christaen Claudia Racioppi
NYU Statione Zoologica Napoli
Enabling Molgula research…
 Develop candidate genes to generate
hypotheses about gene network
evolution;
 Rapid developm...
Transcriptome assembly
thoughts
 We can (now) assemble really big data
sets, and get pretty good results.
 We have lots ...
Transcriptome results - lamprey
 Started with 5.1 billion reads from 50
different tissues.
(4 years of computational rese...
Lamprey transcriptome basic
stats
 616,000 transcripts (!)
 263,000 transcript families (!)
(This seems like a lot.)
Lamprey transcriptome basic
stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families hav...
Common vs rare genes
#transcripts
# samples
Camille Scott
Can look at transcripts by tissue -
-
Camille Scott
Too… many… samples…
Camille Scott
Presence/absence clustering
Expression-based clustering
Some known biology recapitulated; and… ???
Camille Scott
Next steps with lamprey
 Far more complete transcriptome than the one
generated from the genome!
 (…but suffering from c...
Next challenges
OK, we can deal with volume of data,
make pretty pictures, and ... Now what?
Contamination!
Both experimental or “real” contaminants are big pro
Camille Scott
Pathway predictions vary
dramatically depending on data
set, annotation
Likit Preeyanon
KEGG
pathway
comparison
across sev...
The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"
"...ignorome genes do not differ fr...
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now
cheaper than data ga...
1. khmer-protocols
 Effort to provide standard “cheap”
assembly protocols for the cloud.
 Entirely copy/paste; ~2-6 days...
CC0; BSD; on github; in reStructuredText.
A few thoughts on our
approach…
 Explicitly a “protocol” – explicit steps, copy-paste,
customizable.
 No requirement for...
2. Data availability is important for
annotating distant sequences
Anything else Mollusc Cephalopod
no similarity
Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people’s existing data for fre...
First results: Loligo
genomic/transcriptome resources
Putting other people’s sequences where my
mouth is:
w/Josh Rosenthal...
Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Qingpeng Zhang
...
Elijah Lowe
MSU
C. Titus Brown Billie J. Swalla
MSU UW
Thanks!
2014 villefranche
2014 villefranche
Upcoming SlideShare
Loading in …5
×

2014 villefranche

482 views
337 views

Published on

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
482
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2014 villefranche

  1. 1. C. Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University May 2014 ctb@msu.edu Applying mRNAseq to non-model organisms: challenges, opportunities, and solutions
  2. 2. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog (‘titus brown blog’)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints available.
  3. 3. Sequencing has become very inexpensive.
  4. 4. Sequencing costs  Approximately $1000 of mRNAseq will yield a decent transcriptome.  Multiple samples will allow you to generate gene inventories.  For the ascidian project I will show you,  1 graduate student,  2 transcriptomes,  3 genomes…
  5. 5. Mapping => quantitation Reference transcriptome required.
  6. 6. Interpreting RNAseq requires gene models: http://www.hitseq.com/images/RNA-seq_AS.jpg
  7. 7. The challenges of non-model transcriptomics  Missing or low quality genome reference.  Evolutionarily distant.  Most extant computational tools focus on model organisms –  Assume low polymorphism (internal variation)  Assume reference genome  Assume somewhat reliable functional annotation  More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  8. 8. Outline 1. Challenges of non-model transcriptomics. 2. Lamprey: too much data, not enough genome 3. Digital normalization as a coping mechanism 4. …applied to Molgulid ascidians… 5. …and back to lamprey. 6. More transcriptome challenges 7. What’s next? Note: I also work on metagenomics, which I will not discuss t
  9. 9. Sea lamprey in the Great Lakes  Non-native  Parasite of medium to large fishes  Caused populations of host fishes to crash Li Lab / Y-W C-D
  10. 10. The problem of lamprey:  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  11. 11. Lamprey has incomplete genomic sequence J. Smith et al., PNAS 2009 Evidence of somatic recombination; 100s of mb of sequence eliminated from genome during development. More recent evidence (unpub, J. Smith et al.) suggests that this loss is developmentally regulated, results in changes in gene expression (due to loss of genes!), and is tissue specific. Liver genome is not the entire genome.
  12. 12. Lamprey tissues for which we have mRNAseq embryo stages (late blastula, gastrula, neurula, 22b, neural- crest migration, 24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female head skin adult intestine metamorphosis 4 (intestine, kidney) preovulatory female eye adult kidney metamorphosis 5 (liver, intestine, kidney) preovulatory female tail skin brain paired metamorphosis 6 (intestine, kidney) prespermiating male gill freshwater (gill, intestine, kidney) metamorphosis 7 (intestine, kidney) mature adult male rope tissue larval (gill, kidney, liver, intestine) monocytes spermiating male gill juvenile (intestine, liver, kidney) brain (0,3,21 dpi) spermiating male head skin lips spinal cord (0.3.21 dpi) supraneural tissue metamorphosis 1 (intestine, kidney) spermiating male muscle small parasite distal intestine, kidney, proximal intestine metamorphosis 2 (liver, intestine, salt water (gill, intestine)
  13. 13. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  14. 14. Shared low-level transcripts may not reach the threshold for assembly.
  15. 15. Main problem (4 years ago): We have a massive amount of data that challenges existing computers when we try to assemble it all together.
  16. 16. Solution: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  17. 17. Digital normalization
  18. 18. Digital normalization
  19. 19. Digital normalization
  20. 20. Digital normalization
  21. 21. Digital normalization
  22. 22. Digital normalization
  23. 23. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of sequencing. => Enables analyses that are otherwise completely impossible.
  24. 24. Evaluating diginorm – how?  Can’t assemble lamprey w/o diginorm; are results any good & how would we know?  Need comparative data set  …ascidians!
  25. 25. Looking at the Molgula… Putnam et al., 2008, Nature.Modified from Swalla 2001
  26. 26. Sea squirts! Molgula oculata Molgula occulta Molgula oculata Ciona intestinalis Elijah Lowe; collaboration w/Billie Swalla
  27. 27. Challenging organisms to work on --  Only spawn ~1 month out of the year  Located off the northern coast of France (Roscoff)  Hybrids not found outside of lab conditions  Species cannot be cultured (yet)  Wet lab techniques are not fully developed for species
  28. 28. Tail loss and notochord genes a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
  29. 29. Diginorm applied to Molgula embryonic mRNAseq
  30. 30. Substantial time savings (3-5x) << RAM Elijah Lowe
  31. 31. Question: does it matter what assembly pipeline you use? (No) 3 70 25 1 36 13563 35 13 7 4 23 8 1 6 5 Diginorm V/O Raw V/O Diginorm trinity Raw trinity Numbers are putative orthologs (reciprocal best hits) w/Ciona intestinalis, calculated for each assembly. Elijah Lowe
  32. 32. How complete are these transcriptomes? Elijah Lowe
  33. 33. Shift in differentially expressed genes from gastrulation to neurulation M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula Differentially expressed during neurulation in M. ocu vs M. occ
  34. 34. Notochord gene expression similar to tailed species -10 -5 0 5 10 15 -10-5051015 Expression difference Hybrid vs Parent species log2(hybrid)-log2(oculata) log2(hybrid)-log2(occulta)
  35. 35. M. occulta transgenic NoTrlc Alberto Stolfi & Lionel Christiaen
  36. 36. Lionel Christaen Claudia Racioppi NYU Statione Zoologica Napoli
  37. 37. Enabling Molgula research…  Develop candidate genes to generate hypotheses about gene network evolution;  Rapid development of genomic resources => reporter constructs. Doesn’t answer any biological questions directly, but enables us to go looking for things much faster!
  38. 38. Transcriptome assembly thoughts  We can (now) assemble really big data sets, and get pretty good results.  We have lots of evidence (some presented here :) that some assemblies are not strongly affected by digital normalization. (Note: normalization algorithm is now standard part of Trinity mRNAseq pipeline.)
  39. 39. Transcriptome results - lamprey  Started with 5.1 billion reads from 50 different tissues. (4 years of computational research, and about 1 month of compute time, GO HERE) Ended with:
  40. 40. Lamprey transcriptome basic stats  616,000 transcripts (!)  263,000 transcript families (!) (This seems like a lot.)
  41. 41. Lamprey transcriptome basic stats  616,000 transcripts  263,000 transcript families  Only 20436 transcript families have transcripts > 1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts.
  42. 42. Common vs rare genes #transcripts # samples Camille Scott
  43. 43. Can look at transcripts by tissue - - Camille Scott
  44. 44. Too… many… samples… Camille Scott Presence/absence clustering
  45. 45. Expression-based clustering Some known biology recapitulated; and… ??? Camille Scott
  46. 46. Next steps with lamprey  Far more complete transcriptome than the one generated from the genome!  (…but suffering from contamination, oversensitivity to unprocessed transcripts, …?)  Enabling studies in –  Basal vertebrate phylogeny  Biliary atresia  Evolutionary origin of brown fat (previously thought to be mammalian only!)  Pheromonal response in adults  Spinal cord regeneration
  47. 47. Next challenges OK, we can deal with volume of data, make pretty pictures, and ... Now what?
  48. 48. Contamination! Both experimental or “real” contaminants are big pro Camille Scott
  49. 49. Pathway predictions vary dramatically depending on data set, annotation Likit Preeyanon KEGG pathway comparison across several different gene annotation sets for chicken
  50. 50. The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889.lide courtesy Erich Schwarz
  51. 51. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud (per-hour rental compute resources – e.g. Amazon Web Services).
  52. 52. 1. khmer-protocols  Effort to provide standard “cheap” assembly protocols for the cloud.  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis.  Open, versioned, forkable, citable. (“Don’t bother me unless it doesn’t work.”) Read cleaning Diginorm Assembly Annotation RSEM differential expression
  53. 53. CC0; BSD; on github; in reStructuredText.
  54. 54. A few thoughts on our approach…  Explicitly a “protocol” – explicit steps, copy-paste, customizable.  No requirement for computational expertise or significant computational hardware.  ~1-5 days to teach a bench biologist to use.  $100-150 of rental compute (“cloud computing”)…  …for $1000 data set.  Adding in quality control and internal validation steps.
  55. 55. 2. Data availability is important for annotating distant sequences Anything else Mollusc Cephalopod no similarity
  56. 56. Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people’s existing data for free, IFF they open it up within a year. See: • CephSeq white paper. • “Dead Sea Scrolls & Open Marine Transcriptome Project” blog post; Note: data sets can now be cited.
  57. 57. First results: Loligo genomic/transcriptome resources Putting other people’s sequences where my mouth is: w/Josh Rosenthal and Benton Grav
  58. 58. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jason Pell  Arend Hintze  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Camille Scott  Jordan Fish  Michael Crusoe  Leigh Sheneman  Billie Swalla (UW)  Josh Rosenthal (UPR)  Weiming Li, MSU  Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Buxbaum (MSSM) Funding USDA NIFA; NSF IOS; NIH; BEACON.
  59. 59. Elijah Lowe MSU
  60. 60. C. Titus Brown Billie J. Swalla MSU UW
  61. 61. Thanks!

×