1. Genomics and bioinformatics in non-model
organisms: where is the
data tidal wave taking us?
C. Titus Brown
Assistant Professor
Microbiology; Computer Science; BEACON
Michigan State University
Feb 2014
ctb@msu.edu
2. Practical implications of sequencing -Molgula oculata
One graduate student;
Two transcriptomes;
Three draft genomes;
In four years.
Molgula oculata
Molgula occulta
Elijah Lowe
Ciona intestinalis
5. Our research philosophy:
Enable good biology by generating hypotheses
worth testing.
Try to maximize sensitivity of analyses, in light of
fairly high specificity in sequencing based
approaches.
Collaborate intensively on research projects.
Typically, share graduate students with ―wet‖ labs.
Goal is to cross-train everyone involved.
6. Three mini-stories:
1.
Building better gene models for chicken
2.
Dealing with an endless stream of data
3.
Evaluating the effect of gene model
completeness on pathway prediction.
7. 1. Building a better chicken (gene
model)
Most extant computational tools focus on model
organisms..
Assume low polymorphism (internal variation)
Assume quality reference genome or transcriptome
Assume somewhat reliable functional annotation
More significant compute infrastructure
requirements
Likit Preeyanon
How can we best use mRNAseq for chicken?
9. Marek‘s Disease project:
To identify alternative splicing that contributes to
disease resistance.
w/Hans Cheng, USDA ADOL
Inbred line 6
Inbred line 7
10. Types of Alternative Splicing
40%
25%
<5%, more in plants, fungi, protozoa
Karen H, Lev-Maor G & Ast G Nat Genet 2010
11. Data
RNA-Seq from chicken line 6 (resistant) and 7
(susceptible)
Pre and post infection
Single-end reads for assembly (~30 million reads x 4)
Paired-end reads for validation (~40 million reads x 4)
Chicken genome: galGal3
ESTs from UCSC genome website
mRNA from Genbank
w/Hans Cheng, USDA ADOL; Jerry Dodgson, M
20. Gene Model Summary
Method
Gene
Transcript
Global Assembly
14,832
32,311
Local Assembly
15,297
23,028
Global + Local Assembly
15,934
46,797
*Number of genes and transcripts might be overestimated due to incomplete assemb
and spurious splice junctions.
21. Cross-validation with technical
replicates
Later,
Does independent sequencing data confirm? better data => confirms
Dataset
Single-end
Mapped
Unmapped
Paired-end
Mapped
Unmapped
Line 6
uninfected
18,375,966
(77.93%)
5,203,586
(22.07%)
21,598,218
(64.16%)
12,065,659
(35.84%)
Line 6 infected
17,160,695
(73.18%)
6,288,286
(26.82%)
15,274,638
(63.89%)
8633855
(36.11%)
Line 7
uninfected
18,130,072
(75.77%)
5,795,737
(24.22%)
20,961,033
(63.67%)
11,960,299
(36.33%)
Line 7 infected
19,912,046
(78.51%)
5,450,521
(21.49%)
22,485,833
(65.22%)
11,992,002
(34.78%)
24. Gimme pipeline
Our pipeline can detect many isoforms
Local assembly enhances isoform detection
Cufflinks (mapping-based gene models) is not
superior to de novo transcriptome assembly in
chicken…
(Was Cufflinks trained on mouse/human?)
The pipeline can be used to build gene models
for other organisms
Pipeline can do incremental combining of new
data sets
30. Differential Exon Usage
Summary
Number of exons
Adjusted p-value
False
True
0.1
18,631
66
0.01
18,656
41
0.001
18,663
34
Chromosome 1
Total 3,728 genes
Next steps: scaling analysis to entire genome.
And… interpretation (??)
31. Gene model thoughts - Can build gene models that represent the data
we have fairly well;
Robust exon-exon splice site reporting;
Planning ahead for multiple iterations of new
data;
…interpretation of results? See story 3.
32. 2. Endless data!
It is now under $1000 to generate a new
mRNAseq data set.
Collaborators routinely generate new data sets
every 3-6 months… (note: each of them, x 510…)
How can we make use of this data iteratively!?
33. Making iterative use of new data.
Data!
Refined gene
models
Existing gene
models
Differential
expression
??
Some data will yield
new gene models, but
much will be redundant
(e.g. ―housekeeping‖
genes)
40. Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
Is single pass: looks at each read only once;
Does not ―collect‖ the majority of sequencing
errors;
Keeps all low-coverage reads;
Enables analyses that are otherwise completely
impossible;
Integrated into several assemblers (Trinity and
43. But: does diginorm “lose” transcript
information? No.
M. occulta
Diginorm
Raw
37
13623
C. intestinalis
M. oculata
Diginorm
Raw
17
missing 2446
64
13646
15
missing 2398
C. intestinalis
Reciprocal best hit vs. Ciona
BLAST e-value cutoff: 1e-6
Elijah Lowe
44. Where are we taking diginorm?
Streaming online algorithms only look at data
~once.
Diginorm is streaming, online…
Conceptually, can move many aspects of
sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
45. => Streaming, online variant
calling.
Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.
See NIH BIG DATA grant, http://ged.msu.edu/
46. Prospective: sequencing tumor cells
Goal: phylogenetically reconstruct causal ―driver
mutations‖ in face of passenger mutations.
1000 cells x 3 Gbp x 20 coverage: 60 Tbp of
sequence.
Most of this data will be redundant and not useful.
Developing diginorm-based algorithms to
eliminate data while retaining variant information.
See NIH BIG DATA grant, http://ged.msu.edu/
47. 3. Evaluating effects of gene models
on pathway prediction
Vertically integrated comparison.
Likit Preeyanon
56. So, where does this leave us?
Our methods for generating hypotheses from
mRNAseq data are sensitive to references &
technical details of the approaches.
(This is expected but Bad.)
We can build (and have built!) approaches that
we believe to be more accurate for non- or semimodel organisms.
(They‘re also open; try ‗em out.)
=> Standards for execution, evaluation,
comparison, and education.
57. khmer-protocols:
Read cleaning
Effort to provide standard ―cheap‖
assembly protocols for the cloud.
Diginorm
Entirely copy/paste; ~2-6 days from
raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
Open, versioned, forkable, citable.
(Announced at Davis in December ‗13!)
Assembly
Annotation
RSEM differential
expression
60. A few thoughts on our
approach…
Explicitly a ―protocol‖ – explicit steps, copy-paste,
customizable.
No requirement for computational expertise or
significant computational hardware.
~1-5 days to teach a bench biologist to use.
$100-150 of rental compute (―cloud computing‖)…
…for $1000 data set.
Adding in quality control and internal validation
steps.
61. Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a
tremendously open and collaborative endeavor. (Let‘s
take advantage of it!)
―It‘s as if somewhere, out there, is a collection of totally
free software that can do a far better job than ours can,
with open, published methods, great support networks
and fantastic tutorials. But that‘s madness – who on
Earth would create such an amazing resource?‖
http://thescienceweb.wordpress.com/2014/02/21/bioinfo
rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/
62. Where is the data tidal wave taking
biology!?
A world with a lot more data, and, eventually, a lot
more information.
A more integrative world: genomics, molecular
function, evolution, population genetics,
monitoring, ??, and models that feed back into
experimental design.
―Data-Intensive Biology‖
63. Data intensive biology & hypothesis
generation
My interest in biological data is to enable better
hypothesis generation.
64. Additional projects - Bacterial symbionts of bone eating worms – w/Shana Goffredi.
(ISME, 2013)
Genome of Haemonchus contortus, a parasitic nematode (with
Erich Schwarz and Robin Gasser). (Genome Biology, 2013)
Soil metagenome analysis (with Jim Tiedje, Susannah Tringe,
and Janet Jansson). (In review, PNAS.)
Lamprey transcriptome (with Weiming Li). (in preparation).
Ascidian genomes and transcriptomes (with Billie Swalla). (in
preparation)
Loligo pealeii (the giant axon squid) – 5 transcriptomes and skim
genome posted publicly (Feb 2014).
65. In progress
Cattle paratuberculosis analysis (w/Paul
Coussens).
Improving the chick genome using nth-generation
sequencing technology (PacBio, Moleculo).
and building software and protocols to make it
easy for the next 1000 genomes.
66. % of reads aligning
Moleculo data vs chick genome.
Luiz Irber
Read length
67. What are the challenges ahead?
Obviously: Genotype/phenotype mapping.
But also: Conserved unknown/unannotated
genes.
Data sharing, and more generally open
access/data/source/science.
Data integration!
68. The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
lide courtesy Erich Schwarz
Ref.: Pandey et al. (2014), PLoS One 11, e88889.
70. Thanks!
References and grants at
http://ged.msu.edu/research.html
Software at http://github.com/ged-lab/
Blog at http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
E-mail me: ctb@msu.edu
Editor's Notes
For the first project, we are interested in finding alternative isoforms that differentially expressed in chickens line 6 and 7, which is resistant and susceptible to Marek’s disease respectively.Both line 6 and line 7 can get infected by Marek’s disease virus but only line 7 develop T-cell lymphoma.Studies have shown that alternative splicing can increase susceptibility of some diseases in human so we hypothesize that it might play the same role in Marek’s disease.
In this study we used single-end reads from line 6 and line 7, before and after infection to build gene models and use paired-end reads from the same samples for validation.We also use ESTs and mRNA from genbank to validate the gene models.
The we assemble short reads to obtain longer contigs.We used two assembly methods called global and local assembly to increase the sensitivity of isoform detection.We also do assembly with multiple k-mer or hash length to obtain transcripts with different expression levels.We then removed low complexity sequences and trimmed off poly-A tail. Then we mapped all contigs to the genome using BLAT.The alignments from BLAT were then used to predict all putative isoforms, which is done by a program called Gimme that I developed.Then a coding region of each isoforms is predicted by ESTScan.
In the pipeline we used two assembly methods called global and local assembly.In local assembly, only reads mapped to a genome are assembled, on the other hand, all reads are assembled in global assembly.Basically, we used a program that can map both spliced and unspliced reads to the genome, for example Tophat.Then we extract reads mapped to each chromosome and perform assembly of those reads separately using velvet and oases.
This figure shows alignments of sequences from assembly that are aligned chicken genomeOftentimes we do not get a complete transcript from assembly, so I develop Gimme, a program that assembles transcripts based on sequence alignment.It basically merges all incomplete transcripts from assembly together and predict the structure of the gene model with all possible isoforms.The program works with all kind of sequences including expressed sequence tags and mRNAs.Therefore, we can also incorporate data from other sources to build gene models.
This figure shows alignments of sequences from assembly that are aligned chicken genomeOftentimes we do not get a complete transcript from assembly, so I develop Gimme, a program that assembles transcripts based on sequence alignment.It basically merges all incomplete transcripts from assembly together and predict the structure of the gene model with all possible isoforms.The program works with all kind of sequences including expressed sequence tags and mRNAs.Therefore, we can also incorporate data from other sources to build gene models.
This is an example of complete annotated gene models compared with gene models from our pipeline.Our gene models include both isoforms as well as correct coding region.
And this figure shows extra isoforms that only detected by local assembly. The highlighted exon is not found in global assembly but it is annotated in reference sequence, this means that global assembly is missing a real exon, which can only be found by local assembly.
Gene model from RNA-Seq can be used to improve existing gene models, for example we can extend untranslated region which is not well annotated and difficult to predict from a genome sequence.
From out gene models, a total number of genes is about 15,000 genes with 47,000 transcripts, however this number is overestimated due to incomplete assembly.
The easiest way to validate gene models is to map the same set of reads back to the gene models. We found that up to 78% of single-end reads are mapped to the gene models.This number is high for RNA-Seq data and really indicates that the gene models are high-quality. Also up to 65% of paired-end reads from the same samples are mapped to the gene models.The paired-end mapping is more stringent, so the number help confirm a good quality of the gene models.
To validate splice junctions, we compared splice junctions found in our models to ESTs and mRNA.~80% of splice junctions are supported by Genbank mRNA or ESTs or both, which indicates that these splice junctions are real.21,000 splice junctions that are not supported by mRNA and ESTs may include some novel splice junctions.
To summarize, our method can detect many known and unknown isoforms from RNA-Seq data and local assembly technique increases sensitivity of isoform detection.Cufflinks is not better than de novo assembly in chicken.And the pipeline should work with RNA-Seq data from other organisms.
The green model is from single-end reads. Skipped exon in not included in gene models but detected by DEXSeq.
6x more. What do we do?
Since I work with multiple people, I really notice.
Note general problem with bioinfo.
Translation initiation factor
Lure them in with bioinformatics and then show them that Michigan, in the summertime, isqite nice!