2014 whitney-research
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

2014 whitney-research

on

  • 638 views

 

Statistics

Views

Total Views
638
Views on SlideShare
636
Embed Views
2

Actions

Likes
0
Downloads
3
Comments
0

2 Embeds 2

http://www.slideee.com 1
https://demo.plu.mx 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Transcripts are then mapped back to the chicken genome. Because the transcripts are mature mRNA, only exons will map to the genome.The solid boxes represent exons.As shown in this figure, different isoforms are detected using different parameter settings. The explanation for this phenomenon is unknown. The goal of this step is to define all exons in each gene.
  • Larvae/stream bottoms 3-6 years; parasitic adult -> great lakes, 12-20 months feeding. 5-8 years. 40 lbs of fish per life as parasite. 98% of fish in great lakes went away!
  • XXX
  • Marine invertebrates, chordata phylum, notochord, hollow dorsal nerve cord, pharyngeal slits and a post anal tail at some point in life. colonial, filter feeders
  • Notochord cells present, do not intercalate or extend

2014 whitney-research Presentation Transcript

  • 1. Like the dog that caught the bus: now what? Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University Feb 2014 ctb@msu.edu
  • 2. The challenges of non-model transcriptomics  Missing or low quality genome reference.  Evolutionarily distant.  Most extant computational tools focus on model organisms –  Assume low polymorphism (internal variation)  Assume reference genome  Assume somewhat reliable functional annotation  More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  • 3. Isoform analysis – some easy…
  • 4. Isoform analysis – some hard Counting methods mostly rely on presence of unique sequence to which to map.
  • 5. Types of Alternative Splicing 40% 25% <5%, more in plants, fungi, protozoa Karen H, Lev-Maor G & Ast G Nat Genet 2010
  • 6. locate, given genomic sequence
  • 7. Genome-reference-free assembly leads to many isoforms. Massive redundancy!
  • 8. Gene models can be “collapsed” given genomic sequence.
  • 9. In sum,  mRNAseq is pretty easy to deal with if you have a good genomic sequence.  We don‟t have a good genomic sequence for many organisms, including lamprey.  We need to do de novo assembly to construct a transcriptome from short reads.  We also have lots and lots of mRNAseq sequence:
  • 10. The problem of lamprey…  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  • 11. Sea lamprey in the Great Lakes  Non-native  Parasite of medium to large fishes  Caused populations of host fishes to crash Li Lab / Y-W C-D
  • 12. The problem of lamprey…  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  • 13. Lamprey has incomplete genomic sequence Evidence of somatic recombination; 100s of mb of sequence eliminated from genome during development. More recent evidence (unpub, J. Smith et al.) suggests that this loss is developmentally regulated, results in changes in gene expression (due to loss of genes!), and is tissue specific. Liver genome is not the entire genome. J. Smith et al., PNAS 2009
  • 14. Lamprey tissues for which we have mRNAseq embryo stages (late blastula, gastrula, neurula, 22b, n eural-crest migration, 24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female head skin preovulatory female eye adult intestine metamorphosis 4 (intestine, kidney) preovulatory female tail skin adult kidney metamorphosis 5 (liver, intestine, kidney) brain paired metamorphosis 6 (intestine, kidney) prespermiating male gill freshwater (gill, intestine, kidney) metamorphosis 7 (intestine, kidney) mature adult male rope tissue larval (gill, kidney, liver, intestine) monocytes juvenile (intestine, liver, kidney) brain (0,3,21 dpi) lips spinal cord (0.3.21 dpi) metamorphosis 1 (intestine, kidney) metamorphosis 2 (liver, intestine, spermiating male muscle spermiating male gill spermiating male head skin supraneural tissue small parasite distal intestine, kidney, proximal intestine salt water (gill, intestine)
  • 15. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 16. Shared low-level transcripts may not reach the threshold for assembly.
  • 17. Two problems:  We have a massive amount of data that challenges existing computers, and we want to assemble it all together.  We need to construct transcript families (to collapse isoforms) without having a solid reference genome.
  • 18. Solution 1: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 19. Digital normalization
  • 20. Digital normalization
  • 21. Digital normalization
  • 22. Digital normalization
  • 23. Digital normalization
  • 24. Digital normalization
  • 25. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of regions. => Enables analyses that are otherwise completely impossible.
  • 26. Solution 2: Partitioning transcripts into “transcript families” Transcript family Pell et al., 2012, PNAS
  • 27. Transcriptome results - lamprey  Started with 5.1 billion reads from 50 different tissues. (4 years of computational research, and about 1 month of compute time, GO HERE) Ended with:
  • 28. Lamprey transcriptome basic stats  616,000 transcripts (!)  263,000 transcript families (!) (This seems like a lot.)
  • 29. Lamprey transcriptome basic stats  616,000 transcripts  263,000 transcript families  Only 20436 transcript families have transcripts > 1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts.
  • 30. Validation -Assume computers lie. How do we judge precision & recall? 1) Homology! Do we see sequence similarity to e.g. mouse sequences? 1) Orthogonal data sets and analyses For example, look at sperm genome, or independently cloned CDS.
  • 31. Evolution: mouse  58,000 lamprey transcript families have some matches to mouse.  10,000 putative orthologs (reciprocal best hits) So that‟s a pretty good sign. (expecting about ~30k total genes) Conclusion: These numbers “feel” good to me; hard to know what to expect after ~350-500 mya.
  • 32. Orthogonal data set: pm2 (liver genome)  64% of our new transcript families have a match in pm2.  71% of conserved transcript families have a match in pm2.  83% of long transcripts have a match in pm2. Good – we don‟t expect 100%, because we know pm2 is probably missing stuff. So that means: Conclusion: At least 64% of transcript families are “really lamprey” (and > 83% of the long transcripts!)
  • 33. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNAs have a match in sperm genome. So sperm genome is “pretty good” for cross validation. But only  71% of our new transcript families have a match in sperm genome. ??
  • 34. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNAs have a match in sperm genome. New transcriptome:  71% of transcript families have a match in sperm genome.  92% (!!) of long transcript families have a match in sperm genome. (Since the sperm genome is low coverage, this length dependence makes sense – the longer the
  • 35. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNA have a match in sperm genome. New transcriptome:  71% of new transcriptome families have a match in sperm genome.  92% (!!) of long transcript families have a match in sperm genome. Conclusion: Our is poorer than but comparable
  • 36. Orthogonal data set: full-length cDNAs  We can look at both precision and recall by asking  Are known sequences represented completely by a single transcript? (“best match”)  Are known sequences covered by one or more transcripts? (“total matches”) 70% 90%
  • 37. Best matches – not great.
  • 38. Total matches – better!
  • 39. Ref-based (lamp0) ”best” are better than new assembly (lamp3)
  • 40. lamp3 “total” is better than lamp0
  • 41. Conclusions from full-length cDNA  Ref-based data set has longer “best matches” (better precision; less fragmented)  De novo assembly is more sensitive overall (better recall; contains more real sequences)
  • 42. Mapping percentages (with orthogonal data) Ona Bloom generated more data; how much maps? Ref-based New/all New/long BR SC 29.20% 42.94% 100.00%100.00% 45.99% 46.89% Conclusion: Ref-based is considerably less “complete” than new, de novo transcriptome assembly.
  • 43. Lamprey transcriptome conclusions  A substantial portion of the new transcriptome seems “good”:  58k transcript families with mouse homology, 10k     orthologs; 20k transcript families with transcripts > 1kb. Good matches to liver genome & sperm genome. Reasonable numbers ~mouse. Much (!) better than ref-based for mapping. (2x as good) But!  Poor recall of known full-length cDNA !?  240k partitions with only small sequences !? => microbial contamination?
  • 44. Separate question: how much of the pm2 genome is missing??  64% of lamp3 transcript families match to pm2.  82.5% of long transcript families match to pm2.  71% of lamp3 transcript families conserved with mouse match in pm2. Conclusion I: Probably about 30% of genic sequence is missing.
  • 45. Separate question: how much of the pm2 genome is missing??  64% of lamp3 transcript families match to pm2.  82.5% of long transcript families match to pm2.  71% of lamp3 transcript families conserved with mouse match in pm2.  22.5% of sperm genome contigs have no hits in pm2. Conclusion II (firmer): About 30% of single-copy sequence is missing.
  • 46. CEGMA based completeness estimates (Core eukaryotic genes) Number seqs Completeness / 100% matches Completeness / partial matches lamp3 entire 620k lamp3 all ORFs > 80aa 269k lamp3 longest ORF in tr 80k 70.6 96.4 46.4 89 41.1 77.8 lamp0 44.7 62.5 11k Camille Scott
  • 47. Looking at the Molgula… Putnam et Modified al., 2008, Nature. from Swalla 2001
  • 48. What do these animals look like? Molgula oculata Molgula oculata Molgula occulta Ciona intestinalis
  • 49. Tail loss and notochord genes a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
  • 50. Diginorm applied to Molgula embryonic mRNAseq No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total 42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,266 15,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584 Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70%
  • 51. Question: does normalization “lose” transcript information? M. occulta Diginorm Raw 37 C. intestinalis 13623 M. oculata Diginorm Raw 17 missing 2446 64 C. intestinalis 13646 15 missing 2398 Reciprocal best hit vs. Ciona Blast e-value cutoff: 1e-6 Elijah Lowe
  • 52. Transcriptome assembly thoughts  We can (now) assemble really big data sets, and get pretty good results.  We have lots of evidence (some presented here :) that some assemblies are not strongly affected by digital normalization.
  • 53. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
  • 54. 1. khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
  • 55. CC0; BSD; on github; in reStructuredText.
  • 56. 2. Data availability is important for annotating distant sequences no similarity Anything else Mollusc Cephalopod
  • 57. Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people‟s existing data for free, IFF they open it up within a year. See: • CephSeq white paper. • “Dead Sea Scrolls & Open Marine Transcriptome Project” blog post;
  • 58. First results: Loligo genomic/transcriptome resources Putting other people‟s sequences where my mouth is:
  • 59. Tools to routinely update metazoan orthology/homology relationships  > 100 mRNAseq data sets already;  Build interconnections between them via homology;  Build tools to update interconnections as new data sets arrive.  Provide raw data, processed data, underlying tools, simple Web interface, all CC0/in da cloud/open/reproducible. (Question: what biology problems could we tackle?)
  • 60. “Research singularity” The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.
  • 61. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 62. Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Josh Rosenthal (UPR)  Weiming Li, MSU  Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Funding Buxbaum (MSSM) USDA NIFA; NSF IOS; NIH; BEACON.