Genomics and bioinformatics in non-model
organisms: where is the
data tidal wave taking us?
C. Titus Brown
Assistant Professor
Microbiology; Computer Science; BEACON
Michigan State University
Feb 2014
ctb@msu.edu
Practical implications of sequencing -Molgula oculata

One graduate student;
Two transcriptomes;
Three draft genomes;
In four years.
Molgula oculata

Molgula occulta

Elijah Lowe

Ciona intestinalis
Research
Agricultural
genomics &
transcriptomics

Metagenomics
(Environmental &
host-associated)

Novel
computational
approaches

Computing
+
Biology
Education and
training

Good software
development

Capacity building

Evo-devo
genomics &
transcriptomics

Open science/
source/data/
access
Research
Agricultural
genomics &
transcriptomics

Metagenomics
(Environmental &
host-associated)

Novel
computational
approaches

Computing
+
Biology
Education and
training

Good software
development

Capacity building

Evo-devo
genomics &
transcriptomics

Open science/
source/data/
access
Our research philosophy:
 Enable good biology by generating hypotheses

worth testing.
 Try to maximize sensitivity of analyses, in light of

fairly high specificity in sequencing based
approaches.
 Collaborate intensively on research projects.
 Typically, share graduate students with ―wet‖ labs.

 Goal is to cross-train everyone involved.
Three mini-stories:
1.

Building better gene models for chicken

2.

Dealing with an endless stream of data

3.

Evaluating the effect of gene model
completeness on pathway prediction.
1. Building a better chicken (gene
model)
 Most extant computational tools focus on model

organisms..
 Assume low polymorphism (internal variation)
 Assume quality reference genome or transcriptome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure

requirements
Likit Preeyanon

 How can we best use mRNAseq for chicken?
Interpreting RNAseq requires gene
models:

http://www.hitseq.com/images/RNA-seq_AS.jp
Marek‘s Disease project:
 To identify alternative splicing that contributes to

disease resistance.
w/Hans Cheng, USDA ADOL

Inbred line 6

Inbred line 7
Types of Alternative Splicing
40%

25%

<5%, more in plants, fungi, protozoa

Karen H, Lev-Maor G & Ast G Nat Genet 2010
Data
 RNA-Seq from chicken line 6 (resistant) and 7

(susceptible)
 Pre and post infection
 Single-end reads for assembly (~30 million reads x 4)
 Paired-end reads for validation (~40 million reads x 4)

 Chicken genome: galGal3
 ESTs from UCSC genome website
 mRNA from Genbank

w/Hans Cheng, USDA ADOL; Jerry Dodgson, M
Pipeline
Global
Assembl
y
k=21-31

Velvet 1.2.03
Oases 0.2.06

Local
Assembl
y k=2131

Trimming and
cleaning

Seqclean

Mapping to a genome

BLAT

Other gene models
Build all putative
isoforms

Gimme 0.9.0

Predict coding regions

ESTScan 2.1
Local Assembly – early attempt to scale
Tophat 2.0

Velvet/Oases
Assembler
Predicting putative isoforms
w/Gimme:

Source code is publicly available at https://github.com/ged-lab/gimme.git
Exon Graph approach (―Gimme‖)
exon2

exon1

exons2

intron1

exon3

intron2

Exon3.a

exon1

https://github.com/ged-lab/gimme.git

exon2

Exon3.b

exon3

Likit Preeyanon
Predicting putative isoforms
w/Gimme:

Source code is publicly available at https://github.com/ged-lab/gimme.git
We recover annotated isoforms…

USP15

Both annotated isoforms are detected by our pipeline.
…and we detect unknown
isoforms.

TOM1

Local assembly increase sensitivity of isoform detection.
Example of extended 3‘UTR
UTR

SLC25A3
Gene Model Summary
Method

Gene

Transcript

Global Assembly

14,832

32,311

Local Assembly

15,297

23,028

Global + Local Assembly

15,934

46,797

*Number of genes and transcripts might be overestimated due to incomplete assemb
and spurious splice junctions.
Cross-validation with technical
replicates
Later,
Does independent sequencing data confirm? better data => confirms
Dataset

Single-end
Mapped

Unmapped

Paired-end
Mapped

Unmapped

Line 6
uninfected

18,375,966
(77.93%)

5,203,586
(22.07%)

21,598,218
(64.16%)

12,065,659
(35.84%)

Line 6 infected

17,160,695
(73.18%)

6,288,286
(26.82%)

15,274,638
(63.89%)

8633855
(36.11%)

Line 7
uninfected

18,130,072
(75.77%)

5,795,737
(24.22%)

20,961,033
(63.67%)

11,960,299
(36.33%)

Line 7 infected

19,912,046
(78.51%)

5,450,521
(21.49%)

22,485,833
(65.22%)

11,992,002
(34.78%)
Cross-validation w/read splicing

95% of splice junctions have more than three spliced reads
Splice junction comparison
Assembled transcripts
104,366

Genbank mRNA
74,065

7,756

2,412

21,128

46,132
17,765

34,694

110,543
Expressed Sequence Tags
209,134
95% of splice junctions supported by > 4 reads.
Gimme pipeline
 Our pipeline can detect many isoforms
 Local assembly enhances isoform detection
 Cufflinks (mapping-based gene models) is not

superior to de novo transcriptome assembly in
chicken…
(Was Cufflinks trained on mouse/human?)
 The pipeline can be used to build gene models

for other organisms
 Pipeline can do incremental combining of new
data sets
How to detectSpliced reads
differential splicing

2
7

12
21

45
43

98
86

Read coverage

120 45
112 95

?

230
243
Exon Region Comparison

2
7

12
21

25 20
23 20

98
86

Read coverage

120 45
112 95

40
43

203
199
Skipped Exon

DEXseq
Skipped Exon
sulfatase
BRCA1 domain

Alternative 3‘UTR

DNA repair, apoptosis, DNA replication, genome stability
Differential Exon Usage
Summary
Number of exons
Adjusted p-value

False

True

0.1

18,631

66

0.01

18,656

41

0.001

18,663

34

Chromosome 1
Total 3,728 genes

Next steps: scaling analysis to entire genome.
And… interpretation (??)
Gene model thoughts - Can build gene models that represent the data

we have fairly well;
 Robust exon-exon splice site reporting;

 Planning ahead for multiple iterations of new

data;
 …interpretation of results? See story 3.
2. Endless data!
 It is now under $1000 to generate a new

mRNAseq data set.
 Collaborators routinely generate new data sets

every 3-6 months… (note: each of them, x 510…)
 How can we make use of this data iteratively!?
Making iterative use of new data.

Data!

Refined gene
models

Existing gene
models

Differential
expression

??

Some data will yield
new gene models, but
much will be redundant
(e.g. ―housekeeping‖
genes)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
 Is single pass: looks at each read only once;
 Does not ―collect‖ the majority of sequencing

errors;
 Keeps all low-coverage reads;

Enables analyses that are otherwise completely
impossible;
Integrated into several assemblers (Trinity and
Evaluating on ascidians (sea squirts):
Molgula oculata

Molgula oculata

Molgula occulta

Ciona intestinalis
Diginorm applied to Molgula
embryonic mRNAseq – set aside
~90% of data
No.$ reads Reads$
of$
kept
M.#
occulta$
F+3
M.#
occulta$
F+3
M.#
occulta$
F+4
M.#
occulta$
F+5
M.#
occulta$
F+6
M.#
occulta!Total
M.#
oculata$
F+3
M.#
oculata$
F+4
M.#
oculata$
F+6
M.#
oculata!Total

42,174,510
50,018,302
44,948,983
53,692,296
45,782,981
236,617,072
47,045,433
52,890,938
50,156,895
150,093,266

15,642,268
6,012,894
3,499,935
2,993,715
2,774,342
30,923,154
10,754,899
3,949,489
2,874,196
17,578,584

Percentage$
kept
?
?
?
?
?
13%
?
?
?
11.70%
But: does diginorm “lose” transcript
information? No.
M. occulta
Diginorm
Raw

37

13623

C. intestinalis

M. oculata
Diginorm
Raw

17

missing 2446

64

13646

15

missing 2398

C. intestinalis

Reciprocal best hit vs. Ciona
BLAST e-value cutoff: 1e-6

Elijah Lowe
Where are we taking diginorm?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.

See NIH BIG DATA grant, http://ged.msu.edu/
Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal ―driver

mutations‖ in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.

See NIH BIG DATA grant, http://ged.msu.edu/
3. Evaluating effects of gene models
on pathway prediction

Vertically integrated comparison.

Likit Preeyanon
KEGG Pathway
Ensembl Enriched KEGG Pathway
Term

Count

Benjamin

Cytokine-cytokine receptor interaction

36

6.2E-02

Lysosome

25

1.2E-01

Apoptosis

19

3.5E-01

Arginine and proline metabolism

12

3.1E-01

Starch and sucrose metabolism

9

3.4E-01

Toll-like receptor signaling pathway

19

3.7E-01

Natural killer cell mediated cytotoxicity

17

3.4E-01

Cytosolic DNA-sensing pathway

9

4.2E-01

Valine, leucine and isoleucine degradation

11

4.1E-01

Glutathione metabolism

10

4.3E-01

NOD-line receptor signaling pathway

11

4.6E-01

Intestinal immune network for IgA production

9

5.6E-01

VEGF signaling pathway

14

5.6E-01

PPAR signaling pathway

13

6E-01
Gimme Enriched KEGG Pathway
Term

Count

Benjamin

Cytokine-cytokine receptor interaction

34

3.7E-02

Toll-like receptor signaling pathway

22

2.7E-02

Jak-STAT signaling pathway

28

3.4E-02

Arginine and proline metabolism

13

4.5E-02

Lysosome

22

1.3E-01

Natural killer cell mediated cytotoxicity

17

1.6E-01

Alanine, aspartate and glutamate metabolism

9

1.8E-01

Amino sugar and nucleotide sugar metabolism

10

3.6E-01

Cysteine and methionine metabolism

9

4E-01

ECM-receptor interaction

16

3.7E-01

Apoptosis

16

3.7E-01

Glycosis / Gluconeogenesis

11

4E-01

DNA replication

8

3.8E-01

Cell adhesion molecules (CAMs)

19

4.6E-01

PPAR signaling pathway

12

6E-01

Intestinal immune network for IgA production

8

6.1E-01
Compared Enriched KEGG Pathway
Term
Cytokine-cytokine receptor interaction
Toll-like receptor signaling pathway

Common

Lysosome
Apoptosis

Arginine and proline metabolism
Natural killer cells
Intestinal immune network for IgA production
PPAR signaling pathway
Starch and sucrose

Ensembl

Valine, leucine and isoleucine degradation
Glutathione metabolism
NOD-like receptor signaling pathway
VEGF signaling pathway
Jak-STAT signaling pathway
Alanine, aspartate and glutamate metabolism
Amino sugar and nucleotide sugar metabolism
ECM-receptor interaction
Cell adhesion molecules (CAMs)
DNA replication

Gimme
Ensembl

Common

Gimme
INFB – we annotate UTR not
present in other gene models.
INFB – 3‘ bias + missing UTR =>
insensitive
Ensembl

Common

Gimme
So, where does this leave us?
 Our methods for generating hypotheses from

mRNAseq data are sensitive to references &
technical details of the approaches.
(This is expected but Bad.)
 We can build (and have built!) approaches that

we believe to be more accurate for non- or semimodel organisms.
(They‘re also open; try ‗em out.)
=> Standards for execution, evaluation,
comparison, and education.
khmer-protocols:
Read cleaning

 Effort to provide standard ―cheap‖

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
 Open, versioned, forkable, citable.

(Announced at Davis in December ‗13!)

Assembly

Annotation

RSEM differential
expression
CC0; BSD; on github; in reStructuredText.
Summer NGS workshop (2010-2017)
A few thoughts on our
approach…
 Explicitly a ―protocol‖ – explicit steps, copy-paste,

customizable.
 No requirement for computational expertise or

significant computational hardware.
 ~1-5 days to teach a bench biologist to use.
 $100-150 of rental compute (―cloud computing‖)…
 …for $1000 data set.

 Adding in quality control and internal validation

steps.
Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a
tremendously open and collaborative endeavor. (Let‘s
take advantage of it!)
―It‘s as if somewhere, out there, is a collection of totally
free software that can do a far better job than ours can,
with open, published methods, great support networks
and fantastic tutorials. But that‘s madness – who on
Earth would create such an amazing resource?‖

http://thescienceweb.wordpress.com/2014/02/21/bioinfo
rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/
Where is the data tidal wave taking
biology!?
 A world with a lot more data, and, eventually, a lot

more information.
 A more integrative world: genomics, molecular

function, evolution, population genetics,
monitoring, ??, and models that feed back into
experimental design.
―Data-Intensive Biology‖
Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.
Additional projects - Bacterial symbionts of bone eating worms – w/Shana Goffredi.

(ISME, 2013)
 Genome of Haemonchus contortus, a parasitic nematode (with

Erich Schwarz and Robin Gasser). (Genome Biology, 2013)
 Soil metagenome analysis (with Jim Tiedje, Susannah Tringe,

and Janet Jansson). (In review, PNAS.)
 Lamprey transcriptome (with Weiming Li). (in preparation).
 Ascidian genomes and transcriptomes (with Billie Swalla). (in

preparation)
 Loligo pealeii (the giant axon squid) – 5 transcriptomes and skim

genome posted publicly (Feb 2014).
In progress
 Cattle paratuberculosis analysis (w/Paul

Coussens).
 Improving the chick genome using nth-generation

sequencing technology (PacBio, Moleculo).
and building software and protocols to make it
easy for the next 1000 genomes.
% of reads aligning

Moleculo data vs chick genome.

Luiz Irber

Read length
What are the challenges ahead?
 Obviously: Genotype/phenotype mapping.
 But also: Conserved unknown/unannotated

genes.
 Data sharing, and more generally open

access/data/source/science.
 Data integration!
The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."

lide courtesy Erich Schwarz

Ref.: Pandey et al. (2014), PLoS One 11, e88889.
Thanks!
Thanks!
 References and grants at

http://ged.msu.edu/research.html
 Software at http://github.com/ged-lab/

 Blog at http://ivory.idyll.org/blog/
 Twitter: @ctitusbrown

E-mail me: ctb@msu.edu

2014 davis-talk

  • 1.
    Genomics and bioinformaticsin non-model organisms: where is the data tidal wave taking us? C. Titus Brown Assistant Professor Microbiology; Computer Science; BEACON Michigan State University Feb 2014 ctb@msu.edu
  • 2.
    Practical implications ofsequencing -Molgula oculata One graduate student; Two transcriptomes; Three draft genomes; In four years. Molgula oculata Molgula occulta Elijah Lowe Ciona intestinalis
  • 3.
    Research Agricultural genomics & transcriptomics Metagenomics (Environmental & host-associated) Novel computational approaches Computing + Biology Educationand training Good software development Capacity building Evo-devo genomics & transcriptomics Open science/ source/data/ access
  • 4.
    Research Agricultural genomics & transcriptomics Metagenomics (Environmental & host-associated) Novel computational approaches Computing + Biology Educationand training Good software development Capacity building Evo-devo genomics & transcriptomics Open science/ source/data/ access
  • 5.
    Our research philosophy: Enable good biology by generating hypotheses worth testing.  Try to maximize sensitivity of analyses, in light of fairly high specificity in sequencing based approaches.  Collaborate intensively on research projects.  Typically, share graduate students with ―wet‖ labs.  Goal is to cross-train everyone involved.
  • 6.
    Three mini-stories: 1. Building bettergene models for chicken 2. Dealing with an endless stream of data 3. Evaluating the effect of gene model completeness on pathway prediction.
  • 7.
    1. Building abetter chicken (gene model)  Most extant computational tools focus on model organisms..  Assume low polymorphism (internal variation)  Assume quality reference genome or transcriptome  Assume somewhat reliable functional annotation  More significant compute infrastructure requirements Likit Preeyanon  How can we best use mRNAseq for chicken?
  • 8.
    Interpreting RNAseq requiresgene models: http://www.hitseq.com/images/RNA-seq_AS.jp
  • 9.
    Marek‘s Disease project: To identify alternative splicing that contributes to disease resistance. w/Hans Cheng, USDA ADOL Inbred line 6 Inbred line 7
  • 10.
    Types of AlternativeSplicing 40% 25% <5%, more in plants, fungi, protozoa Karen H, Lev-Maor G & Ast G Nat Genet 2010
  • 11.
    Data  RNA-Seq fromchicken line 6 (resistant) and 7 (susceptible)  Pre and post infection  Single-end reads for assembly (~30 million reads x 4)  Paired-end reads for validation (~40 million reads x 4)  Chicken genome: galGal3  ESTs from UCSC genome website  mRNA from Genbank w/Hans Cheng, USDA ADOL; Jerry Dodgson, M
  • 12.
    Pipeline Global Assembl y k=21-31 Velvet 1.2.03 Oases 0.2.06 Local Assembl yk=2131 Trimming and cleaning Seqclean Mapping to a genome BLAT Other gene models Build all putative isoforms Gimme 0.9.0 Predict coding regions ESTScan 2.1
  • 13.
    Local Assembly –early attempt to scale Tophat 2.0 Velvet/Oases Assembler
  • 14.
    Predicting putative isoforms w/Gimme: Sourcecode is publicly available at https://github.com/ged-lab/gimme.git
  • 15.
    Exon Graph approach(―Gimme‖) exon2 exon1 exons2 intron1 exon3 intron2 Exon3.a exon1 https://github.com/ged-lab/gimme.git exon2 Exon3.b exon3 Likit Preeyanon
  • 16.
    Predicting putative isoforms w/Gimme: Sourcecode is publicly available at https://github.com/ged-lab/gimme.git
  • 17.
    We recover annotatedisoforms… USP15 Both annotated isoforms are detected by our pipeline.
  • 18.
    …and we detectunknown isoforms. TOM1 Local assembly increase sensitivity of isoform detection.
  • 19.
    Example of extended3‘UTR UTR SLC25A3
  • 20.
    Gene Model Summary Method Gene Transcript GlobalAssembly 14,832 32,311 Local Assembly 15,297 23,028 Global + Local Assembly 15,934 46,797 *Number of genes and transcripts might be overestimated due to incomplete assemb and spurious splice junctions.
  • 21.
    Cross-validation with technical replicates Later, Doesindependent sequencing data confirm? better data => confirms Dataset Single-end Mapped Unmapped Paired-end Mapped Unmapped Line 6 uninfected 18,375,966 (77.93%) 5,203,586 (22.07%) 21,598,218 (64.16%) 12,065,659 (35.84%) Line 6 infected 17,160,695 (73.18%) 6,288,286 (26.82%) 15,274,638 (63.89%) 8633855 (36.11%) Line 7 uninfected 18,130,072 (75.77%) 5,795,737 (24.22%) 20,961,033 (63.67%) 11,960,299 (36.33%) Line 7 infected 19,912,046 (78.51%) 5,450,521 (21.49%) 22,485,833 (65.22%) 11,992,002 (34.78%)
  • 22.
    Cross-validation w/read splicing 95%of splice junctions have more than three spliced reads
  • 23.
    Splice junction comparison Assembledtranscripts 104,366 Genbank mRNA 74,065 7,756 2,412 21,128 46,132 17,765 34,694 110,543 Expressed Sequence Tags 209,134 95% of splice junctions supported by > 4 reads.
  • 24.
    Gimme pipeline  Ourpipeline can detect many isoforms  Local assembly enhances isoform detection  Cufflinks (mapping-based gene models) is not superior to de novo transcriptome assembly in chicken… (Was Cufflinks trained on mouse/human?)  The pipeline can be used to build gene models for other organisms  Pipeline can do incremental combining of new data sets
  • 25.
    How to detectSplicedreads differential splicing 2 7 12 21 45 43 98 86 Read coverage 120 45 112 95 ? 230 243
  • 26.
    Exon Region Comparison 2 7 12 21 2520 23 20 98 86 Read coverage 120 45 112 95 40 43 203 199
  • 27.
  • 28.
  • 29.
    BRCA1 domain Alternative 3‘UTR DNArepair, apoptosis, DNA replication, genome stability
  • 30.
    Differential Exon Usage Summary Numberof exons Adjusted p-value False True 0.1 18,631 66 0.01 18,656 41 0.001 18,663 34 Chromosome 1 Total 3,728 genes Next steps: scaling analysis to entire genome. And… interpretation (??)
  • 31.
    Gene model thoughts- Can build gene models that represent the data we have fairly well;  Robust exon-exon splice site reporting;  Planning ahead for multiple iterations of new data;  …interpretation of results? See story 3.
  • 32.
    2. Endless data! It is now under $1000 to generate a new mRNAseq data set.  Collaborators routinely generate new data sets every 3-6 months… (note: each of them, x 510…)  How can we make use of this data iteratively!?
  • 33.
    Making iterative useof new data. Data! Refined gene models Existing gene models Differential expression ?? Some data will yield new gene models, but much will be redundant (e.g. ―housekeeping‖ genes)
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    Digital normalization approach Adigital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not ―collect‖ the majority of sequencing errors;  Keeps all low-coverage reads; Enables analyses that are otherwise completely impossible; Integrated into several assemblers (Trinity and
  • 41.
    Evaluating on ascidians(sea squirts): Molgula oculata Molgula oculata Molgula occulta Ciona intestinalis
  • 42.
    Diginorm applied toMolgula embryonic mRNAseq – set aside ~90% of data No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total 42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,266 15,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584 Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70%
  • 43.
    But: does diginorm“lose” transcript information? No. M. occulta Diginorm Raw 37 13623 C. intestinalis M. oculata Diginorm Raw 17 missing 2446 64 13646 15 missing 2398 C. intestinalis Reciprocal best hit vs. Ciona BLAST e-value cutoff: 1e-6 Elijah Lowe
  • 44.
    Where are wetaking diginorm?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  • 45.
    => Streaming, onlinevariant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful. See NIH BIG DATA grant, http://ged.msu.edu/
  • 46.
    Prospective: sequencing tumorcells  Goal: phylogenetically reconstruct causal ―driver mutations‖ in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information. See NIH BIG DATA grant, http://ged.msu.edu/
  • 47.
    3. Evaluating effectsof gene models on pathway prediction Vertically integrated comparison. Likit Preeyanon
  • 48.
  • 49.
    Ensembl Enriched KEGGPathway Term Count Benjamin Cytokine-cytokine receptor interaction 36 6.2E-02 Lysosome 25 1.2E-01 Apoptosis 19 3.5E-01 Arginine and proline metabolism 12 3.1E-01 Starch and sucrose metabolism 9 3.4E-01 Toll-like receptor signaling pathway 19 3.7E-01 Natural killer cell mediated cytotoxicity 17 3.4E-01 Cytosolic DNA-sensing pathway 9 4.2E-01 Valine, leucine and isoleucine degradation 11 4.1E-01 Glutathione metabolism 10 4.3E-01 NOD-line receptor signaling pathway 11 4.6E-01 Intestinal immune network for IgA production 9 5.6E-01 VEGF signaling pathway 14 5.6E-01 PPAR signaling pathway 13 6E-01
  • 50.
    Gimme Enriched KEGGPathway Term Count Benjamin Cytokine-cytokine receptor interaction 34 3.7E-02 Toll-like receptor signaling pathway 22 2.7E-02 Jak-STAT signaling pathway 28 3.4E-02 Arginine and proline metabolism 13 4.5E-02 Lysosome 22 1.3E-01 Natural killer cell mediated cytotoxicity 17 1.6E-01 Alanine, aspartate and glutamate metabolism 9 1.8E-01 Amino sugar and nucleotide sugar metabolism 10 3.6E-01 Cysteine and methionine metabolism 9 4E-01 ECM-receptor interaction 16 3.7E-01 Apoptosis 16 3.7E-01 Glycosis / Gluconeogenesis 11 4E-01 DNA replication 8 3.8E-01 Cell adhesion molecules (CAMs) 19 4.6E-01 PPAR signaling pathway 12 6E-01 Intestinal immune network for IgA production 8 6.1E-01
  • 51.
    Compared Enriched KEGGPathway Term Cytokine-cytokine receptor interaction Toll-like receptor signaling pathway Common Lysosome Apoptosis Arginine and proline metabolism Natural killer cells Intestinal immune network for IgA production PPAR signaling pathway Starch and sucrose Ensembl Valine, leucine and isoleucine degradation Glutathione metabolism NOD-like receptor signaling pathway VEGF signaling pathway Jak-STAT signaling pathway Alanine, aspartate and glutamate metabolism Amino sugar and nucleotide sugar metabolism ECM-receptor interaction Cell adhesion molecules (CAMs) DNA replication Gimme
  • 52.
  • 53.
    INFB – weannotate UTR not present in other gene models.
  • 54.
    INFB – 3‘bias + missing UTR => insensitive
  • 55.
  • 56.
    So, where doesthis leave us?  Our methods for generating hypotheses from mRNAseq data are sensitive to references & technical details of the approaches. (This is expected but Bad.)  We can build (and have built!) approaches that we believe to be more accurate for non- or semimodel organisms. (They‘re also open; try ‗em out.) => Standards for execution, evaluation, comparison, and education.
  • 57.
    khmer-protocols: Read cleaning  Effortto provide standard ―cheap‖ assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)  Open, versioned, forkable, citable. (Announced at Davis in December ‗13!) Assembly Annotation RSEM differential expression
  • 58.
    CC0; BSD; ongithub; in reStructuredText.
  • 59.
  • 60.
    A few thoughtson our approach…  Explicitly a ―protocol‖ – explicit steps, copy-paste, customizable.  No requirement for computational expertise or significant computational hardware.  ~1-5 days to teach a bench biologist to use.  $100-150 of rental compute (―cloud computing‖)…  …for $1000 data set.  Adding in quality control and internal validation steps.
  • 61.
    Can we crowdsourcebioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let‘s take advantage of it!) ―It‘s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that‘s madness – who on Earth would create such an amazing resource?‖ http://thescienceweb.wordpress.com/2014/02/21/bioinfo rmatics-software-companies-have-no-clue-why-no-onebuys-their-products/
  • 62.
    Where is thedata tidal wave taking biology!?  A world with a lot more data, and, eventually, a lot more information.  A more integrative world: genomics, molecular function, evolution, population genetics, monitoring, ??, and models that feed back into experimental design. ―Data-Intensive Biology‖
  • 63.
    Data intensive biology& hypothesis generation  My interest in biological data is to enable better hypothesis generation.
  • 64.
    Additional projects -Bacterial symbionts of bone eating worms – w/Shana Goffredi. (ISME, 2013)  Genome of Haemonchus contortus, a parasitic nematode (with Erich Schwarz and Robin Gasser). (Genome Biology, 2013)  Soil metagenome analysis (with Jim Tiedje, Susannah Tringe, and Janet Jansson). (In review, PNAS.)  Lamprey transcriptome (with Weiming Li). (in preparation).  Ascidian genomes and transcriptomes (with Billie Swalla). (in preparation)  Loligo pealeii (the giant axon squid) – 5 transcriptomes and skim genome posted publicly (Feb 2014).
  • 65.
    In progress  Cattleparatuberculosis analysis (w/Paul Coussens).  Improving the chick genome using nth-generation sequencing technology (PacBio, Moleculo). and building software and protocols to make it easy for the next 1000 genomes.
  • 66.
    % of readsaligning Moleculo data vs chick genome. Luiz Irber Read length
  • 67.
    What are thechallenges ahead?  Obviously: Genotype/phenotype mapping.  But also: Conserved unknown/unannotated genes.  Data sharing, and more generally open access/data/source/science.  Data integration!
  • 68.
    The problem oflopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." lide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
  • 69.
  • 70.
    Thanks!  References andgrants at http://ged.msu.edu/research.html  Software at http://github.com/ged-lab/  Blog at http://ivory.idyll.org/blog/  Twitter: @ctitusbrown E-mail me: ctb@msu.edu

Editor's Notes

  • #10 For the first project, we are interested in finding alternative isoforms that differentially expressed in chickens line 6 and 7, which is resistant and susceptible to Marek’s disease respectively.Both line 6 and line 7 can get infected by Marek’s disease virus but only line 7 develop T-cell lymphoma.Studies have shown that alternative splicing can increase susceptibility of some diseases in human so we hypothesize that it might play the same role in Marek’s disease.
  • #12 In this study we used single-end reads from line 6 and line 7, before and after infection to build gene models and use paired-end reads from the same samples for validation.We also use ESTs and mRNA from genbank to validate the gene models.
  • #13 The we assemble short reads to obtain longer contigs.We used two assembly methods called global and local assembly to increase the sensitivity of isoform detection.We also do assembly with multiple k-mer or hash length to obtain transcripts with different expression levels.We then removed low complexity sequences and trimmed off poly-A tail. Then we mapped all contigs to the genome using BLAT.The alignments from BLAT were then used to predict all putative isoforms, which is done by a program called Gimme that I developed.Then a coding region of each isoforms is predicted by ESTScan.
  • #14 In the pipeline we used two assembly methods called global and local assembly.In local assembly, only reads mapped to a genome are assembled, on the other hand, all reads are assembled in global assembly.Basically, we used a program that can map both spliced and unspliced reads to the genome, for example Tophat.Then we extract reads mapped to each chromosome and perform assembly of those reads separately using velvet and oases.
  • #15 This figure shows alignments of sequences from assembly that are aligned chicken genomeOftentimes we do not get a complete transcript from assembly, so I develop Gimme, a program that assembles transcripts based on sequence alignment.It basically merges all incomplete transcripts from assembly together and predict the structure of the gene model with all possible isoforms.The program works with all kind of sequences including expressed sequence tags and mRNAs.Therefore, we can also incorporate data from other sources to build gene models.
  • #17 This figure shows alignments of sequences from assembly that are aligned chicken genomeOftentimes we do not get a complete transcript from assembly, so I develop Gimme, a program that assembles transcripts based on sequence alignment.It basically merges all incomplete transcripts from assembly together and predict the structure of the gene model with all possible isoforms.The program works with all kind of sequences including expressed sequence tags and mRNAs.Therefore, we can also incorporate data from other sources to build gene models.
  • #18 This is an example of complete annotated gene models compared with gene models from our pipeline.Our gene models include both isoforms as well as correct coding region.
  • #19 And this figure shows extra isoforms that only detected by local assembly. The highlighted exon is not found in global assembly but it is annotated in reference sequence, this means that global assembly is missing a real exon, which can only be found by local assembly.
  • #20 Gene model from RNA-Seq can be used to improve existing gene models, for example we can extend untranslated region which is not well annotated and difficult to predict from a genome sequence.
  • #21 From out gene models, a total number of genes is about 15,000 genes with 47,000 transcripts, however this number is overestimated due to incomplete assembly.
  • #22 The easiest way to validate gene models is to map the same set of reads back to the gene models. We found that up to 78% of single-end reads are mapped to the gene models.This number is high for RNA-Seq data and really indicates that the gene models are high-quality. Also up to 65% of paired-end reads from the same samples are mapped to the gene models.The paired-end mapping is more stringent, so the number help confirm a good quality of the gene models.
  • #24 To validate splice junctions, we compared splice junctions found in our models to ESTs and mRNA.~80% of splice junctions are supported by Genbank mRNA or ESTs or both, which indicates that these splice junctions are real.21,000 splice junctions that are not supported by mRNA and ESTs may include some novel splice junctions.
  • #25 To summarize, our method can detect many known and unknown isoforms from RNA-Seq data and local assembly technique increases sensitivity of isoform detection.Cufflinks is not better than de novo assembly in chicken.And the pipeline should work with RNA-Seq data from other organisms.
  • #29 The green model is from single-end reads. Skipped exon in not included in gene models but detected by DEXSeq.
  • #31 6x more. What do we do?
  • #33 Since I work with multiple people, I really notice.
  • #44 Note general problem with bioinfo.
  • #54 Translation initiation factor
  • #60 Lure them in with bioinformatics and then show them that Michigan, in the summertime, isqite nice!
  • #61 Think lab protocol.
  • #62 More generally….