Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 davis-talk

  • Be the first to comment

2014 davis-talk

  1. 1. Genomics and bioinformatics in non-model organisms: where is the data tidal wave taking us? C. Titus Brown Assistant Professor Microbiology; Computer Science; BEACON Michigan State University Feb 2014 ctb@msu.edu
  2. 2. Practical implications of sequencing -Molgula oculata One graduate student; Two transcriptomes; Three draft genomes; In four years. Molgula oculata Molgula occulta Elijah Lowe Ciona intestinalis
  3. 3. Research Agricultural genomics & transcriptomics Metagenomics (Environmental & host-associated) Novel computational approaches Computing + Biology Education and training Good software development Capacity building Evo-devo genomics & transcriptomics Open science/ source/data/ access
  4. 4. Research Agricultural genomics & transcriptomics Metagenomics (Environmental & host-associated) Novel computational approaches Computing + Biology Education and training Good software development Capacity building Evo-devo genomics & transcriptomics Open science/ source/data/ access
  5. 5. Our research philosophy:  Enable good biology by generating hypotheses worth testing.  Try to maximize sensitivity of analyses, in light of fairly high specificity in sequencing based approaches.  Collaborate intensively on research projects.  Typically, share graduate students with ―wet‖ labs.  Goal is to cross-train everyone involved.
  6. 6. Three mini-stories: 1. Building better gene models for chicken 2. Dealing with an endless stream of data 3. Evaluating the effect of gene model completeness on pathway prediction.
  7. 7. 1. Building a better chicken (gene model)  Most extant computational tools focus on model organisms..  Assume low polymorphism (internal variation)  Assume quality reference genome or transcriptome  Assume somewhat reliable functional annotation  More significant compute infrastructure requirements Likit Preeyanon  How can we best use mRNAseq for chicken?
  8. 8. Interpreting RNAseq requires gene models: http://www.hitseq.com/images/RNA-seq_AS.jp
  9. 9. Marek‘s Disease project:  To identify alternative splicing that contributes to disease resistance. w/Hans Cheng, USDA ADOL Inbred line 6 Inbred line 7
  10. 10. Types of Alternative Splicing 40% 25% <5%, more in plants, fungi, protozoa Karen H, Lev-Maor G & Ast G Nat Genet 2010
  11. 11. Data  RNA-Seq from chicken line 6 (resistant) and 7 (susceptible)  Pre and post infection  Single-end reads for assembly (~30 million reads x 4)  Paired-end reads for validation (~40 million reads x 4)  Chicken genome: galGal3  ESTs from UCSC genome website  mRNA from Genbank w/Hans Cheng, USDA ADOL; Jerry Dodgson, M
  12. 12. Pipeline Global Assembl y k=21-31 Velvet 1.2.03 Oases 0.2.06 Local Assembl y k=2131 Trimming and cleaning Seqclean Mapping to a genome BLAT Other gene models Build all putative isoforms Gimme 0.9.0 Predict coding regions ESTScan 2.1
  13. 13. Local Assembly – early attempt to scale Tophat 2.0 Velvet/Oases Assembler
  14. 14. Predicting putative isoforms w/Gimme: Source code is publicly available at https://github.com/ged-lab/gimme.git
  15. 15. Exon Graph approach (―Gimme‖) exon2 exon1 exons2 intron1 exon3 intron2 Exon3.a exon1 https://github.com/ged-lab/gimme.git exon2 Exon3.b exon3 Likit Preeyanon
  16. 16. Predicting putative isoforms w/Gimme: Source code is publicly available at https://github.com/ged-lab/gimme.git
  17. 17. We recover annotated isoforms… USP15 Both annotated isoforms are detected by our pipeline.
  18. 18. …and we detect unknown isoforms. TOM1 Local assembly increase sensitivity of isoform detection.
  19. 19. Example of extended 3‘UTR UTR SLC25A3
  20. 20. Gene Model Summary Method Gene Transcript Global Assembly 14,832 32,311 Local Assembly 15,297 23,028 Global + Local Assembly 15,934 46,797 *Number of genes and transcripts might be overestimated due to incomplete assemb and spurious splice junctions.
  21. 21. Cross-validation with technical replicates Later, Does independent sequencing data confirm? better data => confirms Dataset Single-end Mapped Unmapped Paired-end Mapped Unmapped Line 6 uninfected 18,375,966 (77.93%) 5,203,586 (22.07%) 21,598,218 (64.16%) 12,065,659 (35.84%) Line 6 infected 17,160,695 (73.18%) 6,288,286 (26.82%) 15,274,638 (63.89%) 8633855 (36.11%) Line 7 uninfected 18,130,072 (75.77%) 5,795,737 (24.22%) 20,961,033 (63.67%) 11,960,299 (36.33%) Line 7 infected 19,912,046 (78.51%) 5,450,521 (21.49%) 22,485,833 (65.22%) 11,992,002 (34.78%)
  22. 22. Cross-validation w/read splicing 95% of splice junctions have more than three spliced reads
  23. 23. Splice junction comparison Assembled transcripts 104,366 Genbank mRNA 74,065 7,756 2,412 21,128 46,132 17,765 34,694 110,543 Expressed Sequence Tags 209,134 95% of splice junctions supported by > 4 reads.
  24. 24. Gimme pipeline  Our pipeline can detect many isoforms  Local assembly enhances isoform detection  Cufflinks (mapping-based gene models) is not superior to de novo transcriptome assembly in chicken… (Was Cufflinks trained on mouse/human?)  The pipeline can be used to build gene models for other organisms  Pipeline can do incremental combining of new data sets
  25. 25. How to detectSpliced reads differential splicing 2 7 12 21 45 43 98 86 Read coverage 120 45 112 95 ? 230 243
  26. 26. Exon Region Comparison 2 7 12 21 25 20 23 20 98 86 Read coverage 120 45 112 95 40 43 203 199
  27. 27. Skipped Exon DEXseq
  28. 28. Skipped Exon sulfatase
  29. 29. BRCA1 domain Alternative 3‘UTR DNA repair, apoptosis, DNA replication, genome stability
  30. 30. Differential Exon Usage Summary Number of exons Adjusted p-value False True 0.1 18,631 66 0.01 18,656 41 0.001 18,663 34 Chromosome 1 Total 3,728 genes Next steps: scaling analysis to entire genome. And… interpretation (??)
  31. 31. Gene model thoughts - Can build gene models that represent the data we have fairly well;  Robust exon-exon splice site reporting;  Planning ahead for multiple iterations of new data;  …interpretation of results? See story 3.
  32. 32. 2. Endless data!  It is now under $1000 to generate a new mRNAseq data set.  Collaborators routinely generate new data sets every 3-6 months… (note: each of them, x 510…)  How can we make use of this data iteratively!?
  33. 33. Making iterative use of new data. Data! Refined gene models Existing gene models Differential expression ?? Some data will yield new gene models, but much will be redundant (e.g. ―housekeeping‖ genes)
  34. 34. Digital normalization
  35. 35. Digital normalization
  36. 36. Digital normalization
  37. 37. Digital normalization
  38. 38. Digital normalization
  39. 39. Digital normalization
  40. 40. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not ―collect‖ the majority of sequencing errors;  Keeps all low-coverage reads; Enables analyses that are otherwise completely impossible; Integrated into several assemblers (Trinity and
  41. 41. Evaluating on ascidians (sea squirts): Molgula oculata Molgula oculata Molgula occulta Ciona intestinalis
  42. 42. Diginorm applied to Molgula embryonic mRNAseq – set aside ~90% of data No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total 42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,266 15,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584 Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70%
  43. 43. But: does diginorm “lose” transcript information? No. M. occulta Diginorm Raw 37 13623 C. intestinalis M. oculata Diginorm Raw 17 missing 2446 64 13646 15 missing 2398 C. intestinalis Reciprocal best hit vs. Ciona BLAST e-value cutoff: 1e-6 Elijah Lowe
  44. 44. Where are we taking diginorm?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  45. 45. => Streaming, online variant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful. See NIH BIG DATA grant, http://ged.msu.edu/
  46. 46. Prospective: sequencing tumor cells  Goal: phylogenetically reconstruct causal ―driver mutations‖ in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information. See NIH BIG DATA grant, http://ged.msu.edu/
  47. 47. 3. Evaluating effects of gene models on pathway prediction Vertically integrated comparison. Likit Preeyanon
  48. 48. KEGG Pathway
  49. 49. Ensembl Enriched KEGG Pathway Term Count Benjamin Cytokine-cytokine receptor interaction 36 6.2E-02 Lysosome 25 1.2E-01 Apoptosis 19 3.5E-01 Arginine and proline metabolism 12 3.1E-01 Starch and sucrose metabolism 9 3.4E-01 Toll-like receptor signaling pathway 19 3.7E-01 Natural killer cell mediated cytotoxicity 17 3.4E-01 Cytosolic DNA-sensing pathway 9 4.2E-01 Valine, leucine and isoleucine degradation 11 4.1E-01 Glutathione metabolism 10 4.3E-01 NOD-line receptor signaling pathway 11 4.6E-01 Intestinal immune network for IgA production 9 5.6E-01 VEGF signaling pathway 14 5.6E-01 PPAR signaling pathway 13 6E-01
  50. 50. Gimme Enriched KEGG Pathway Term Count Benjamin Cytokine-cytokine receptor interaction 34 3.7E-02 Toll-like receptor signaling pathway 22 2.7E-02 Jak-STAT signaling pathway 28 3.4E-02 Arginine and proline metabolism 13 4.5E-02 Lysosome 22 1.3E-01 Natural killer cell mediated cytotoxicity 17 1.6E-01 Alanine, aspartate and glutamate metabolism 9 1.8E-01 Amino sugar and nucleotide sugar metabolism 10 3.6E-01 Cysteine and methionine metabolism 9 4E-01 ECM-receptor interaction 16 3.7E-01 Apoptosis 16 3.7E-01 Glycosis / Gluconeogenesis 11 4E-01 DNA replication 8 3.8E-01 Cell adhesion molecules (CAMs) 19 4.6E-01 PPAR signaling pathway 12 6E-01 Intestinal immune network for IgA production 8 6.1E-01
  51. 51. Compared Enriched KEGG Pathway Term Cytokine-cytokine receptor interaction Toll-like receptor signaling pathway Common Lysosome Apoptosis Arginine and proline metabolism Natural killer cells Intestinal immune network for IgA production PPAR signaling pathway Starch and sucrose Ensembl Valine, leucine and isoleucine degradation Glutathione metabolism NOD-like receptor signaling pathway VEGF signaling pathway Jak-STAT signaling pathway Alanine, aspartate and glutamate metabolism Amino sugar and nucleotide sugar metabolism ECM-receptor interaction Cell adhesion molecules (CAMs) DNA replication Gimme
  52. 52. Ensembl Common Gimme
  53. 53. INFB – we annotate UTR not present in other gene models.
  54. 54. INFB – 3‘ bias + missing UTR => insensitive
  55. 55. Ensembl Common Gimme
  56. 56. So, where does this leave us?  Our methods for generating hypotheses from mRNAseq data are sensitive to references & technical details of the approaches. (This is expected but Bad.)  We can build (and have built!) approaches that we believe to be more accurate for non- or semimodel organisms. (They‘re also open; try ‗em out.) => Standards for execution, evaluation, comparison, and education.
  57. 57. khmer-protocols: Read cleaning  Effort to provide standard ―cheap‖ assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)  Open, versioned, forkable, citable. (Announced at Davis in December ‗13!) Assembly Annotation RSEM differential expression
  58. 58. CC0; BSD; on github; in reStructuredText.
  59. 59. Summer NGS workshop (2010-2017)
  60. 60. A few thoughts on our approach…  Explicitly a ―protocol‖ – explicit steps, copy-paste, customizable.  No requirement for computational expertise or significant computational hardware.  ~1-5 days to teach a bench biologist to use.  $100-150 of rental compute (―cloud computing‖)…  …for $1000 data set.  Adding in quality control and internal validation steps.
  61. 61. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let‘s take advantage of it!) ―It‘s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that‘s madness – who on Earth would create such an amazing resource?‖ http://thescienceweb.wordpress.com/2014/02/21/bioinfo rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/
  62. 62. Where is the data tidal wave taking biology!?  A world with a lot more data, and, eventually, a lot more information.  A more integrative world: genomics, molecular function, evolution, population genetics, monitoring, ??, and models that feed back into experimental design. ―Data-Intensive Biology‖
  63. 63. Data intensive biology & hypothesis generation  My interest in biological data is to enable better hypothesis generation.
  64. 64. Additional projects - Bacterial symbionts of bone eating worms – w/Shana Goffredi. (ISME, 2013)  Genome of Haemonchus contortus, a parasitic nematode (with Erich Schwarz and Robin Gasser). (Genome Biology, 2013)  Soil metagenome analysis (with Jim Tiedje, Susannah Tringe, and Janet Jansson). (In review, PNAS.)  Lamprey transcriptome (with Weiming Li). (in preparation).  Ascidian genomes and transcriptomes (with Billie Swalla). (in preparation)  Loligo pealeii (the giant axon squid) – 5 transcriptomes and skim genome posted publicly (Feb 2014).
  65. 65. In progress  Cattle paratuberculosis analysis (w/Paul Coussens).  Improving the chick genome using nth-generation sequencing technology (PacBio, Moleculo). and building software and protocols to make it easy for the next 1000 genomes.
  66. 66. % of reads aligning Moleculo data vs chick genome. Luiz Irber Read length
  67. 67. What are the challenges ahead?  Obviously: Genotype/phenotype mapping.  But also: Conserved unknown/unannotated genes.  Data sharing, and more generally open access/data/source/science.  Data integration!
  68. 68. The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." lide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
  69. 69. Thanks!
  70. 70. Thanks!  References and grants at http://ged.msu.edu/research.html  Software at http://github.com/ged-lab/  Blog at http://ivory.idyll.org/blog/  Twitter: @ctitusbrown E-mail me: ctb@msu.edu

    Be the first to comment

    Login to see the comments

  • bmpvieira

    Feb. 25, 2014
  • luizirber

    Jan. 20, 2015

Views

Total views

5,297

On Slideshare

0

From embeds

0

Number of embeds

2,126

Actions

Downloads

27

Shares

0

Comments

0

Likes

2

×