2014 ucl

C.Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
May 2014
ctb@msu.edu
Large-scale transcriptome sequencing of non-model
organisms: coping mechanisms

We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
 Twitter: @ctitusbrown
 Grants on LabWeb site: http://ged.msu.edu/research.html
 Preprints available.
Everything is > 80% reproducible.

We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
 Twitter: @ctitusbrown
 Grants on LabWeb site: http://ged.msu.edu/research.html
 Preprints available.
Everything is > 80% reproducible by you.

The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most extant computational tools focus on model organisms –
 Assume low polymorphism (internal variation)
 Assume reference genome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure
…and cannot easily or directly be used on critters of interest.

Outline
1. Challenges of non-model transcriptomics.
2. Lamprey: too much data, not enough genome
3. Digital normalization as a coping mechanism
4. …applied to Molgulid ascidians…
5. …and back to lamprey.
6. More transcriptome challenges
7. What’s next? (Implications of free data + free
data analysis.)

Sea lamprey in the Great Lakes
 Non-native
 Parasite of
medium to large
fishes
 Caused
populations of
host fishes to
crash
Li Lab /Y-W C-D

The problem of lamprey:
 Diverged at base of vertebrates; evolutionarily
distant from model organisms.
 Large, complicated genome (~2 GB)
 Relatively little existing sequence.
 We sequenced the liver genome…

Lamprey has incomplete genomic sequence
J. Smith et al., PNAS 2009
Evidence of somatic recombination; 100s of
mb of sequence eliminated from genome
during development.
More recent evidence (unpub, J. Smith et
al.) suggests that this loss is
developmentally regulated, results in
changes in gene expression (due to loss of
genes!), and is tissue specific.
Liver genome is not the entire
genome.

Lamprey tissues for which we have mRNAseq
embryo stages (late blastula,
gastrula, neurula, 22b, neural-
crest migration, 24c1,24c2)
metamorphosis 3 (intestine,
kidney)
ovulatory female head skin
adult intestine
kidney)
preovulatory female eye
adult kidney
metamorphosis 5 (liver, intestine,
kidney)
preovulatory female tail skin
brain paired
kidney)
prespermiating male gill
freshwater (gill, intestine, kidney)
kidney)
mature adult male rope tissue
larval (gill, kidney, liver, intestine) monocytes
spermiating male gill
juvenile (intestine, liver, kidney) brain (0,3,21 dpi)
spermiating male head skin
lips spinal cord (0.3.21 dpi)
supraneural tissue
kidney) spermiating male muscle
small parasite distal intestine,
kidney, proximal intestine
metamorphosis 2 (liver, intestine, salt water (gill, intestine)

Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness
…but for lots and lots of fragments!

Shared low-level
transcripts may not
reach the threshold
for assembly.

Main problem (4 years ago):
We have a massive amount of data that
challenges existing computers when we try to
assemble it all together.

Solution: Digital normalization
(a computational version of library normalization)
Suppose you have a dilution
factor ofA (10) to B(1). To get
10x of B you need to get 100x
ofA! Overkill!!
This 100x will consume disk
space and, because of errors,
memory.
We can discard it for you…

Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of sequencing.
=> Enables analyses that are otherwise completely impossible.

Evaluating diginorm – how?
 Can’t assemble lamprey w/o diginorm; are
results any good & how would we know?
 Need comparative data set
 …ascidians!

Looking at the Molgula…
Putnam et al., 2008,
Nature.Modified from Swalla 2001

Sea squirts!
Molgula oculata
Molgula occulta
Molgula oculata Ciona intestinalis
Elijah Lowe; collaboration w/Billie Swalla

Tail loss and notochord genes
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996

Diginorm applied to Molgula embryonic
mRNAseq

Substantial time
savings (3-5x) << RAM
Elijah Lowe

Question: does it matter what
assembly pipeline you use? (No)
3
70
25
1
36
13563
35
13
7
4 23 8 1
6
5
Diginorm V/O Raw V/O
Diginorm trinity Raw trinity
Numbers are putative orthologs (reciprocal best hits)
w/Ciona intestinalis,calculated for each assembly.
Elijah Lowe

Why Trinity vs Oases?
Trinity is slightly better at picking out isoforms.
Elijah Lowe

How complete are these
transcriptomes?
Elijah Lowe

Transcriptome assembly thoughts
 We can (now) assemble really big data sets, and
get pretty good results.
 We have lots of evidence (some presented here :)
that some assemblies are not strongly affected by
digital normalization.
(Note: normalization algorithm is now standard
part ofTrinity mRNAseq pipeline.)

Transcriptome results - lamprey
 Started with 5.1 billion reads from 50 different tissues.
(4 years of computational research, and about 1 month of
compute time, GO HERE)
Ended with:

Lamprey transcriptome basic stats
 616,000 transcripts (!)
 263,000 transcript families (!)
(This seems like a lot.)

Lamprey transcriptome basic stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families have transcripts > 1kb
(compare with mouse: 17331 of 29769 genes are > 1kb)
So, estimation by thumb ~ not that off, for long transcripts.

Common vs rare genes
#transcripts
# samples
Camille Scott

Can look at transcripts by tissue --
Camille Scott

Too… many… samples…
Camille Scott
Presence/absence clustering

Expression-based clustering
Some known biology recapitulated; and… ???
Camille Scott

Next challenges
OK, we can deal with volume of data, make pretty
pictures, and ... Now what?

Contamination!
Both experimental or “real” contaminants are big probems.
Camille Scott

Pathway predictions vary dramatically
depending on data set, annotation
Likit Preeyanon
KEGG pathway
comparison
across several
different gene
annotation sets
for chicken

The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889.Slide courtesy Erich Schwarz

Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper
than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in
the cloud (per-hour rental compute
resources).

1. khmer-protocols
 Effort to provide standard “cheap” assembly
protocols for the cloud.
 Entirely copy/paste; ~2-6 days from raw
reads to assembly, annotations, and
differential expression analysis.
 Open, versioned, forkable, citable.
(“Don’t bother me unless it doesn’t work.”
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression

CC0; BSD; on github; in reStructuredText.

A few thoughts on our approach…
 Explicitly a “protocol” – explicit steps, copy-paste,
customizable.
 No requirement for computational expertise or significant
computational hardware.
 ~1-5 days to teach a bench biologist to use.
 $100-150 of rental compute (“cloud computing”)…
 …for $1000 data set.
 Adding in quality control and internal validation steps.

Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a tremendously open and
collaborative endeavor. (Let’s take advantage of it!)
“It’s as if somewhere, out there, is a collection of totally free software
that can do a far better job than ours can, with open, published
methods, great support networks and fantastic tutorials. But that’s
madness – who on Earth would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinformatics
-software-companies-have-no-clue-why-no-one-buys-their-
products/

2. Data availability is important for
annotating distant sequences
Anything else Mollusc Cephalopod
no similarity

Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people’s existing data for free, IFF they open
it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open MarineTranscriptome Project”
blog post;

First results: Loligo
genomic/transcriptome resources
Putting other people’s sequences where my mouth is:
w/Josh Rosenthal and Benton Gravely

“Research singularity”
The data a researchers generates in their lab constitutes
an increasingly small component of the data used to reach
a conclusion.
Corollary:The true value of the data an individual investigator
generates should be considered in the context of aggregate data.
Even if we overcome the social barriers and incentivize sharing,
we are, needless to say, not remotely prepared for sharing all
the data.

Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Camille Scott
 Jordan Fish
 Michael Crusoe
 Leigh Sheneman
 Billie Swalla (UW)
 Josh Rosenthal (UPR)
 Weiming Li, MSU
 Ona Bloom (Feinstein),
Jen Morgan (MBL), Joe
Buxbaum (MSSM)
Funding
USDA NIFA; NSF IOS; NIH;
BEACON.

Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)

2014 ucl

More Related Content

What's hot

Viewers also liked

Similar to 2014 ucl

More from c.titus.brown

Recently uploaded

2014 ucl