2014 davis-talk

Genomics and bioinformatics in non-model
organisms: where is the
data tidal wave taking us?
C. Titus Brown
Assistant Professor
Microbiology; Computer Science; BEACON
Michigan State University
Feb 2014
ctb@msu.edu

Practical implications of sequencing -Molgula oculata

One graduate student;
Two transcriptomes;
Three draft genomes;
In four years.
Molgula oculata

Molgula occulta

Elijah Lowe

Ciona intestinalis

Research
Agricultural
genomics &
transcriptomics

Metagenomics
(Environmental &
host-associated)

Novel
computational
approaches

Computing
+
Biology
Education and
training

Good software
development

Capacity building

Evo-devo
genomics &
transcriptomics

Open science/
source/data/
access

Our research philosophy:
 Enable good biology by generating hypotheses

worth testing.
 Try to maximize sensitivity of analyses, in light of

fairly high specificity in sequencing based
approaches.
 Collaborate intensively on research projects.
 Typically, share graduate students with ―wet‖ labs.

 Goal is to cross-train everyone involved.

Three mini-stories:
1.

Building better gene models for chicken

2.

Dealing with an endless stream of data

3.

Evaluating the effect of gene model
completeness on pathway prediction.

1. Building a better chicken (gene
model)
 Most extant computational tools focus on model

organisms..
 Assume low polymorphism (internal variation)
 Assume quality reference genome or transcriptome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure

requirements
Likit Preeyanon

 How can we best use mRNAseq for chicken?

Interpreting RNAseq requires gene
models:

http://www.hitseq.com/images/RNA-seq_AS.jp

Marek‘s Disease project:
 To identify alternative splicing that contributes to

disease resistance.
w/Hans Cheng, USDA ADOL

Inbred line 6

Inbred line 7

Types of Alternative Splicing
40%

25%

<5%, more in plants, fungi, protozoa

Karen H, Lev-Maor G & Ast G Nat Genet 2010

Data
 RNA-Seq from chicken line 6 (resistant) and 7

(susceptible)
 Pre and post infection
 Single-end reads for assembly (~30 million reads x 4)
 Paired-end reads for validation (~40 million reads x 4)

 Chicken genome: galGal3
 ESTs from UCSC genome website
 mRNA from Genbank

w/Hans Cheng, USDA ADOL; Jerry Dodgson, M

Pipeline
Global
Assembl
y
k=21-31

Velvet 1.2.03
Oases 0.2.06

Local
Assembl
y k=2131

Trimming and
cleaning

Seqclean

Mapping to a genome

BLAT

Other gene models
Build all putative
isoforms

Gimme 0.9.0

Predict coding regions

ESTScan 2.1

Local Assembly – early attempt to scale
Tophat 2.0

Velvet/Oases
Assembler

Predicting putative isoforms
w/Gimme:

Source code is publicly available at https://github.com/ged-lab/gimme.git

Exon Graph approach (―Gimme‖)
exon2

exon1

exons2

intron1

exon3

intron2

Exon3.a

exon1

https://github.com/ged-lab/gimme.git

exon2

Exon3.b

exon3

Likit Preeyanon

We recover annotated isoforms…

USP15

Both annotated isoforms are detected by our pipeline.

…and we detect unknown
isoforms.

TOM1

Local assembly increase sensitivity of isoform detection.

Example of extended 3‘UTR
UTR

SLC25A3

Gene Model Summary
Method

Gene

Transcript

Global Assembly

14,832

32,311

Local Assembly

15,297

23,028

Global + Local Assembly

15,934

46,797

*Number of genes and transcripts might be overestimated due to incomplete assemb
and spurious splice junctions.

Cross-validation with technical
replicates
Later,
Does independent sequencing data confirm? better data => confirms
Dataset

Single-end
Mapped

Unmapped

Paired-end
Mapped

Unmapped

Line 6
uninfected

18,375,966
(77.93%)

5,203,586
(22.07%)

21,598,218
(64.16%)

12,065,659
(35.84%)

Line 6 infected

17,160,695
(73.18%)

6,288,286
(26.82%)

15,274,638
(63.89%)

8633855
(36.11%)

Line 7
uninfected

18,130,072
(75.77%)

5,795,737
(24.22%)

20,961,033
(63.67%)

11,960,299
(36.33%)

Line 7 infected

19,912,046
(78.51%)

5,450,521
(21.49%)

22,485,833
(65.22%)

11,992,002
(34.78%)

Cross-validation w/read splicing

95% of splice junctions have more than three spliced reads

Splice junction comparison
Assembled transcripts
104,366

Genbank mRNA
74,065

7,756

2,412

21,128

46,132
17,765

34,694

110,543
Expressed Sequence Tags
209,134
95% of splice junctions supported by > 4 reads.

Gimme pipeline
 Our pipeline can detect many isoforms
 Local assembly enhances isoform detection
 Cufflinks (mapping-based gene models) is not

superior to de novo transcriptome assembly in
chicken…
(Was Cufflinks trained on mouse/human?)
 The pipeline can be used to build gene models

for other organisms
 Pipeline can do incremental combining of new
data sets

How to detectSpliced reads
differential splicing

2
7

12
21

45
43

98
86

Read coverage

120 45
112 95

?

230
243

Exon Region Comparison

2
7

12
21

25 20
23 20

98
86

Read coverage

120 45
112 95

40
43

203
199

BRCA1 domain

Alternative 3‘UTR

DNA repair, apoptosis, DNA replication, genome stability

Differential Exon Usage
Summary
Number of exons
Adjusted p-value

False

True

0.1

18,631

66

0.01

18,656

41

0.001

18,663

34

Chromosome 1
Total 3,728 genes

Next steps: scaling analysis to entire genome.
And… interpretation (??)

Gene model thoughts - Can build gene models that represent the data

we have fairly well;
 Robust exon-exon splice site reporting;

 Planning ahead for multiple iterations of new

data;
 …interpretation of results? See story 3.

2. Endless data!
 It is now under $1000 to generate a new

mRNAseq data set.
 Collaborators routinely generate new data sets

every 3-6 months… (note: each of them, x 510…)
 How can we make use of this data iteratively!?

Making iterative use of new data.

Data!

Reﬁned gene
models

Existing gene
models

Differential
expression

??

Some data will yield
new gene models, but
much will be redundant
(e.g. ―housekeeping‖
genes)

Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
 Is single pass: looks at each read only once;
 Does not ―collect‖ the majority of sequencing

errors;
 Keeps all low-coverage reads;

Enables analyses that are otherwise completely
impossible;
Integrated into several assemblers (Trinity and

Evaluating on ascidians (sea squirts):
Molgula oculata

Molgula oculata

Molgula occulta

Ciona intestinalis

Diginorm applied to Molgula
embryonic mRNAseq – set aside
~90% of data
No.$ reads Reads$
of$
kept
M.#
occulta$
F+3
M.#
occulta$
F+3
M.#
occulta$
F+4
M.#
occulta$
F+5
M.#
occulta$
F+6
M.#
occulta!Total
M.#
oculata$
F+3
M.#
oculata$
F+4
M.#
oculata$
F+6
M.#
oculata!Total

42,174,510
50,018,302
44,948,983
53,692,296
45,782,981
236,617,072
47,045,433
52,890,938
50,156,895
150,093,266

15,642,268
6,012,894
3,499,935
2,993,715
2,774,342
30,923,154
10,754,899
3,949,489
2,874,196
17,578,584

Percentage$
kept
?
?
?
?
?
13%
?
?
?
11.70%

But: does diginorm “lose” transcript
information? No.
M. occulta
Diginorm
Raw

37

13623

C. intestinalis

M. oculata
Diginorm
Raw

17

missing 2446

64

13646

15

missing 2398

C. intestinalis

Reciprocal best hit vs. Ciona
BLAST e-value cutoff: 1e-6

Elijah Lowe

Where are we taking diginorm?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.

=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.

See NIH BIG DATA grant, http://ged.msu.edu/

Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal ―driver

mutations‖ in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.

See NIH BIG DATA grant, http://ged.msu.edu/

3. Evaluating effects of gene models
on pathway prediction

Vertically integrated comparison.

Likit Preeyanon

Ensembl Enriched KEGG Pathway
Term

Count

Benjamin

Cytokine-cytokine receptor interaction

36

6.2E-02

Lysosome

25

1.2E-01

Apoptosis

19

3.5E-01

Arginine and proline metabolism

12

3.1E-01

Starch and sucrose metabolism

9

3.4E-01

Toll-like receptor signaling pathway

19

3.7E-01

Natural killer cell mediated cytotoxicity

17

3.4E-01

Cytosolic DNA-sensing pathway

9

4.2E-01

Valine, leucine and isoleucine degradation

11

4.1E-01

Glutathione metabolism

10

4.3E-01

NOD-line receptor signaling pathway

11

4.6E-01

Intestinal immune network for IgA production

9

5.6E-01

VEGF signaling pathway

14

5.6E-01

PPAR signaling pathway

13

6E-01

Gimme Enriched KEGG Pathway
Term

Count

Benjamin


34

3.7E-02


22

2.7E-02

Jak-STAT signaling pathway

28

3.4E-02


13

4.5E-02

Lysosome

22

1.3E-01

Natural killer cell mediated cytotoxicity

17

1.6E-01

Alanine, aspartate and glutamate metabolism

9

1.8E-01

Amino sugar and nucleotide sugar metabolism

10

3.6E-01

Cysteine and methionine metabolism

9

4E-01

ECM-receptor interaction

16

3.7E-01

Apoptosis

16

3.7E-01

Glycosis / Gluconeogenesis

11

4E-01

DNA replication

8

3.8E-01

Cell adhesion molecules (CAMs)

19

4.6E-01


12

6E-01


8

6.1E-01

Compared Enriched KEGG Pathway
Term

Common

Lysosome
Apoptosis

Natural killer cells
Starch and sucrose

Ensembl

Valine, leucine and isoleucine degradation
Glutathione metabolism
NOD-like receptor signaling pathway
VEGF signaling pathway
Jak-STAT signaling pathway
Alanine, aspartate and glutamate metabolism
Amino sugar and nucleotide sugar metabolism
ECM-receptor interaction
Cell adhesion molecules (CAMs)
DNA replication

Gimme

INFB – we annotate UTR not
present in other gene models.

INFB – 3‘ bias + missing UTR =>
insensitive

So, where does this leave us?
 Our methods for generating hypotheses from

mRNAseq data are sensitive to references &
technical details of the approaches.
(This is expected but Bad.)
 We can build (and have built!) approaches that

we believe to be more accurate for non- or semimodel organisms.
(They‘re also open; try ‗em out.)
=> Standards for execution, evaluation,
comparison, and education.

khmer-protocols:
Read cleaning

 Effort to provide standard ―cheap‖

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
 Open, versioned, forkable, citable.

(Announced at Davis in December ‗13!)

Assembly

Annotation

RSEM differential
expression

CC0; BSD; on github; in reStructuredText.

Summer NGS workshop (2010-2017)

A few thoughts on our
approach…
 Explicitly a ―protocol‖ – explicit steps, copy-paste,

customizable.
 No requirement for computational expertise or

significant computational hardware.
 ~1-5 days to teach a bench biologist to use.
 $100-150 of rental compute (―cloud computing‖)…
 …for $1000 data set.

 Adding in quality control and internal validation

steps.

Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a
tremendously open and collaborative endeavor. (Let‘s
take advantage of it!)
―It‘s as if somewhere, out there, is a collection of totally
free software that can do a far better job than ours can,
with open, published methods, great support networks
and fantastic tutorials. But that‘s madness – who on
Earth would create such an amazing resource?‖

http://thescienceweb.wordpress.com/2014/02/21/bioinfo
rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/

Where is the data tidal wave taking
biology!?
 A world with a lot more data, and, eventually, a lot

more information.
 A more integrative world: genomics, molecular

function, evolution, population genetics,
monitoring, ??, and models that feed back into
experimental design.
―Data-Intensive Biology‖

Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.

Additional projects - Bacterial symbionts of bone eating worms – w/Shana Goffredi.

(ISME, 2013)
 Genome of Haemonchus contortus, a parasitic nematode (with

Erich Schwarz and Robin Gasser). (Genome Biology, 2013)
 Soil metagenome analysis (with Jim Tiedje, Susannah Tringe,

and Janet Jansson). (In review, PNAS.)
 Lamprey transcriptome (with Weiming Li). (in preparation).
 Ascidian genomes and transcriptomes (with Billie Swalla). (in

preparation)
 Loligo pealeii (the giant axon squid) – 5 transcriptomes and skim

genome posted publicly (Feb 2014).

In progress
 Cattle paratuberculosis analysis (w/Paul

Coussens).
 Improving the chick genome using nth-generation

sequencing technology (PacBio, Moleculo).
and building software and protocols to make it
easy for the next 1000 genomes.

% of reads aligning

Moleculo data vs chick genome.

Luiz Irber

Read length

What are the challenges ahead?
 Obviously: Genotype/phenotype mapping.
 But also: Conserved unknown/unannotated

genes.
 Data sharing, and more generally open

access/data/source/science.
 Data integration!

The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."

lide courtesy Erich Schwarz

Ref.: Pandey et al. (2014), PLoS One 11, e88889.

Thanks!
 References and grants at

http://ged.msu.edu/research.html
 Software at http://github.com/ged-lab/

 Blog at http://ivory.idyll.org/blog/
 Twitter: @ctitusbrown

E-mail me: ctb@msu.edu

2014 davis-talk

More Related Content

What's hot

Viewers also liked

Similar to 2014 davis-talk

More from c.titus.brown

Recently uploaded

2014 davis-talk

Editor's Notes