SlideShare a Scribd company logo
1 of 63
Like the dog that caught the bus:
now what?
Sequencing, Big Data, and Biology
C. Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
Feb 2014
ctb@msu.edu
The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most extant computational tools focus on model

organisms –
 Assume low polymorphism (internal variation)
 Assume reference genome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure

…and cannot easily or directly be used on critters of
interest.
Isoform analysis – some easy…
Isoform analysis – some hard

Counting methods mostly rely on presence of unique
sequence to which to map.
Types of Alternative Splicing
40%

25%

<5%, more in plants, fungi, protozoa

Karen H, Lev-Maor G & Ast G Nat Genet 2010
locate, given genomic
sequence
Genome-reference-free assembly
leads to many isoforms.

Massive redundancy!
Gene models can be “collapsed”
given genomic sequence.
In sum,
 mRNAseq is pretty easy to deal with if you have a

good genomic sequence.
 We don‟t have a good genomic sequence for

many organisms, including lamprey.
 We need to do de novo assembly to construct a

transcriptome from short reads.
 We also have lots and lots of mRNAseq

sequence:
The problem of lamprey…
 Diverged at base of vertebrates; evolutionarily

distant from model organisms.
 Large, complicated genome (~2 GB)

 Relatively little existing sequence.
 We sequenced the liver genome…
Sea lamprey in the Great Lakes
 Non-native

 Parasite of

medium to large
fishes
 Caused
populations of
host fishes to
crash

Li Lab / Y-W C-D
The problem of lamprey…
 Diverged at base of vertebrates; evolutionarily

distant from model organisms.
 Large, complicated genome (~2 GB)

 Relatively little existing sequence.
 We sequenced the liver genome…
Lamprey has incomplete genomic sequence

Evidence of somatic recombination;
100s of mb of sequence eliminated
from genome during development.
More recent evidence (unpub, J.
Smith et al.) suggests that this loss
is developmentally
regulated, results in changes in
gene expression (due to loss of
genes!), and is tissue specific.
Liver genome is not the entire
genome.

J. Smith et al., PNAS 2009
Lamprey tissues for which we have
mRNAseq
embryo stages (late
blastula, gastrula, neurula, 22b, n
eural-crest migration, 24c1,24c2)

metamorphosis 3 (intestine,
kidney)

ovulatory female head skin
preovulatory female eye

adult intestine

metamorphosis 4 (intestine,
kidney)

preovulatory female tail skin

adult kidney

metamorphosis 5 (liver, intestine,
kidney)

brain paired

metamorphosis 6 (intestine,
kidney)

prespermiating male gill

freshwater (gill, intestine, kidney)

metamorphosis 7 (intestine,
kidney)

mature adult male rope tissue

larval (gill, kidney, liver, intestine)

monocytes

juvenile (intestine, liver, kidney)

brain (0,3,21 dpi)

lips

spinal cord (0.3.21 dpi)

metamorphosis 1 (intestine,
kidney)

metamorphosis 2 (liver, intestine,

spermiating male muscle

spermiating male gill
spermiating male head skin
supraneural tissue
small parasite distal intestine,
kidney, proximal intestine
salt water (gill, intestine)
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness

…but for lots and lots of fragments!
Shared low-level
transcripts may
not reach the
threshold for
assembly.
Two problems:
 We have a massive amount of data that

challenges existing computers, and we want to
assemble it all together.
 We need to construct transcript families (to

collapse isoforms) without having a solid
reference genome.
Solution 1: Digital normalization
(a computational version of library normalization)

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
We can discard it for
you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of regions.

=> Enables analyses that are otherwise completely
impossible.
Solution 2: Partitioning transcripts
into “transcript families”

Transcript family

Pell et al., 2012, PNAS
Transcriptome results - lamprey
 Started with 5.1 billion reads from 50 different

tissues.
(4 years of computational research, and about 1
month of compute time, GO HERE)

Ended with:
Lamprey transcriptome basic
stats
 616,000 transcripts (!)
 263,000 transcript families (!)

(This seems like a lot.)
Lamprey transcriptome basic
stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families have transcripts >

1kb
(compare with mouse: 17331 of 29769 genes
are > 1kb)
So, estimation by thumb ~ not that off, for long
transcripts.
Validation -Assume computers lie. How do we judge precision
& recall?
1) Homology!

Do we see sequence similarity to e.g. mouse
sequences?
1) Orthogonal data sets and analyses

For example, look at sperm genome, or
independently cloned CDS.
Evolution: mouse
 58,000 lamprey transcript families have some

matches to mouse.
 10,000 putative orthologs (reciprocal best hits)
So that‟s a pretty good sign.
(expecting about ~30k total genes)
Conclusion:
These numbers “feel” good to me; hard to know
what to expect after ~350-500 mya.
Orthogonal data set: pm2 (liver
genome)
 64% of our new transcript families have a match in

pm2.
 71% of conserved transcript families have a
match in pm2.
 83% of long transcripts have a match in pm2.
Good – we don‟t expect 100%, because we know pm2
is probably missing stuff. So that means:

Conclusion:
At least 64% of transcript families are “really lamprey”
(and > 83% of the long transcripts!)
Orthogonal data set: sperm genome
 94.2% of ref-based transcripts have a match in

sperm genome.
 98.2% of full-length cDNAs have a match in
sperm genome.

So sperm genome is “pretty good” for cross
validation.
But only
 71% of our new transcript families have a match
in sperm genome. ??
Orthogonal data set: sperm genome
 94.2% of ref-based transcripts have a match in sperm

genome.
 98.2% of full-length cDNAs have a match in sperm
genome.
New transcriptome:
 71% of transcript families have a match in sperm
genome.
 92% (!!) of long transcript families have a match in
sperm genome.
(Since the sperm genome is low coverage, this length
dependence makes sense – the longer the
Orthogonal data set: sperm genome
 94.2% of ref-based transcripts have a match in sperm

genome.
 98.2% of full-length cDNA have a match in sperm
genome.
New transcriptome:
 71% of new transcriptome families have a match in
sperm genome.
 92% (!!) of long transcript families have a match in
sperm genome.
Conclusion:
Our is poorer than but comparable
Orthogonal data set: full-length
cDNAs
 We can look at both precision and recall by

asking
 Are known sequences represented completely by a

single transcript? (“best match”)
 Are known sequences covered by one or more
transcripts? (“total matches”)
70%
90%
Best matches – not great.
Total matches – better!
Ref-based (lamp0) ”best” are better
than new assembly (lamp3)
lamp3 “total” is better than lamp0
Conclusions from full-length
cDNA
 Ref-based data set has longer “best matches”

(better precision; less fragmented)
 De novo assembly is more sensitive overall
(better recall; contains more real sequences)
Mapping percentages
(with orthogonal data)
Ona Bloom generated more data; how much
maps?

Ref-based
New/all
New/long

BR
SC
29.20%
42.94%
100.00%100.00%
45.99%
46.89%

Conclusion:
Ref-based is considerably less “complete” than
new, de novo transcriptome assembly.
Lamprey transcriptome conclusions
 A substantial portion of the new transcriptome seems

“good”:
 58k transcript families with mouse homology, 10k






orthologs;
20k transcript families with transcripts > 1kb.
Good matches to liver genome & sperm genome.
Reasonable numbers ~mouse.
Much (!) better than ref-based for mapping. (2x as good)

But!
 Poor recall of known full-length cDNA !?
 240k partitions with only small sequences !?
=> microbial contamination?
Separate question: how much of the
pm2 genome is missing??
 64% of lamp3 transcript families match to pm2.
 82.5% of long transcript families match to pm2.
 71% of lamp3 transcript families conserved with

mouse match in pm2.
Conclusion I:
Probably about 30% of genic sequence is missing.
Separate question: how much of the
pm2 genome is missing??
 64% of lamp3 transcript families match to pm2.
 82.5% of long transcript families match to pm2.
 71% of lamp3 transcript families conserved with

mouse match in pm2.
 22.5% of sperm genome contigs have no hits in

pm2.

Conclusion II (firmer):
About 30% of single-copy sequence is missing.
CEGMA based completeness estimates
(Core eukaryotic genes)
Number
seqs

Completeness /
100% matches

Completeness /
partial matches

lamp3 entire
620k
lamp3 all ORFs >
80aa
269k
lamp3 longest
ORF in tr
80k

70.6

96.4

46.4

89

41.1

77.8

lamp0

44.7

62.5

11k

Camille Scott
Looking at the Molgula…

Putnam et
Modified al., 2008, Nature.
from Swalla 2001
What do these animals look like?
Molgula oculata

Molgula oculata

Molgula occulta

Ciona intestinalis
Tail loss and notochord genes

a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange
Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
Diginorm applied to Molgula
embryonic mRNAseq
No.$ reads Reads$
of$
kept
M.#
occulta$
F+3
M.#
occulta$
F+3
M.#
occulta$
F+4
M.#
occulta$
F+5
M.#
occulta$
F+6
M.#
occulta!Total
M.#
oculata$
F+3
M.#
oculata$
F+4
M.#
oculata$
F+6
M.#
oculata!Total

42,174,510
50,018,302
44,948,983
53,692,296
45,782,981
236,617,072
47,045,433
52,890,938
50,156,895
150,093,266

15,642,268
6,012,894
3,499,935
2,993,715
2,774,342
30,923,154
10,754,899
3,949,489
2,874,196
17,578,584

Percentage$
kept
?
?
?
?
?
13%
?
?
?
11.70%
Question: does normalization “lose”
transcript information?
M. occulta
Diginorm
Raw

37

C. intestinalis

13623

M. oculata
Diginorm
Raw

17

missing 2446

64

C. intestinalis

13646

15

missing 2398

Reciprocal best hit vs. Ciona
Blast e-value cutoff: 1e-6
Elijah Lowe
Transcriptome assembly
thoughts
 We can (now) assemble really big data sets, and

get pretty good results.
 We have lots of evidence (some presented here :)

that some assemblies are not strongly affected by
digital normalization.
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.
1. khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to
assembly, annotations, and
differential expression analysis.
~$150 on Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression
CC0; BSD; on github; in reStructuredText.
2. Data availability is important for
annotating distant sequences
no similarity

Anything else

Mollusc

Cephalopod
Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people‟s existing data for

free, IFF they open it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open Marine Transcriptome
Project” blog post;
First results: Loligo
genomic/transcriptome resources
Putting other people‟s sequences where my
mouth is:
Tools to routinely update metazoan
orthology/homology relationships
 > 100 mRNAseq data sets already;
 Build interconnections between them via homology;
 Build tools to update interconnections as new data

sets arrive.
 Provide raw data, processed data, underlying

tools, simple Web interface, all CC0/in da
cloud/open/reproducible.
(Question: what biology problems could we tackle?)
“Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟
Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Josh Rosenthal

(UPR)
 Weiming Li, MSU
 Ona Bloom
(Feinstein), Jen
Morgan (MBL), Joe
Funding
Buxbaum (MSSM)
USDA NIFA; NSF IOS;
NIH; BEACON.

More Related Content

What's hot

Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomicsMads Albertsen
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsMicrobial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsJonathan Eisen
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008Saul Kravitz
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASAmin Mohamed
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk Universitymcdonadt
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingJonathan Eisen
 
'Novel technologies to study the resistome'
'Novel technologies to study the resistome''Novel technologies to study the resistome'
'Novel technologies to study the resistome'Willem van Schaik
 
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Jonathan Eisen
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiomeMick Watson
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Mick Watson
 

What's hot (20)

Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsMicrobial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk University
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Testing for Food Authenticity
Testing for Food AuthenticityTesting for Food Authenticity
Testing for Food Authenticity
 
'Novel technologies to study the resistome'
'Novel technologies to study the resistome''Novel technologies to study the resistome'
'Novel technologies to study the resistome'
 
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
 
Studying the microbiome
Studying the microbiomeStudying the microbiome
Studying the microbiome
 
Mason abrf single_cell_2017
Mason abrf single_cell_2017Mason abrf single_cell_2017
Mason abrf single_cell_2017
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
 

Viewers also liked

Manduca
ManducaManduca
Manducanbmro
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsKegler Brown Hill + Ritter
 
Cuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesCuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesKegler Brown Hill + Ritter
 
Long term evaluation of IL programme slides
Long term evaluation of IL programme slidesLong term evaluation of IL programme slides
Long term evaluation of IL programme slidesTina Hohmann
 
Nobel Michelle Social Media
Nobel Michelle Social MediaNobel Michelle Social Media
Nobel Michelle Social MediaPiet van Vugt
 
Real Kings Of Logistics
Real Kings Of LogisticsReal Kings Of Logistics
Real Kings Of Logisticsbamadogg
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentationlmeneley
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagersrwarntjes
 
45 lessons in life
45 lessons in life45 lessons in life
45 lessons in lifeDaniel Chua
 
Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Gaurab Dutta
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattlec.titus.brown
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESAlexander Lavrov
 
Fokuspunkter ved br10 hvordan skal der bygges
Fokuspunkter ved br10   hvordan skal der byggesFokuspunkter ved br10   hvordan skal der bygges
Fokuspunkter ved br10 hvordan skal der byggesBertel Bolt-Jørgensen
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition
 

Viewers also liked (20)

Manduca
ManducaManduca
Manduca
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 
Cuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesCuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + Challenges
 
Coalition Orientation to Public
Coalition Orientation to PublicCoalition Orientation to Public
Coalition Orientation to Public
 
Long term evaluation of IL programme slides
Long term evaluation of IL programme slidesLong term evaluation of IL programme slides
Long term evaluation of IL programme slides
 
Nobel Michelle Social Media
Nobel Michelle Social MediaNobel Michelle Social Media
Nobel Michelle Social Media
 
Real Kings Of Logistics
Real Kings Of LogisticsReal Kings Of Logistics
Real Kings Of Logistics
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentation
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagers
 
About BMC
About BMCAbout BMC
About BMC
 
45 lessons in life
45 lessons in life45 lessons in life
45 lessons in life
 
Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010Motoholics Sponsorship Proposal 2010
Motoholics Sponsorship Proposal 2010
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
 
2012 stamps-mbl-2
2012 stamps-mbl-22012 stamps-mbl-2
2012 stamps-mbl-2
 
Langkah Membuat Blogspot
Langkah Membuat BlogspotLangkah Membuat Blogspot
Langkah Membuat Blogspot
 
Nursing Skills
Nursing SkillsNursing Skills
Nursing Skills
 
Fokuspunkter ved br10 hvordan skal der bygges
Fokuspunkter ved br10   hvordan skal der byggesFokuspunkter ved br10   hvordan skal der bygges
Fokuspunkter ved br10 hvordan skal der bygges
 
Vipo Vinduer
Vipo VinduerVipo Vinduer
Vipo Vinduer
 
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
 

Similar to 2014 whitney-research

Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysisDr. Olusoji Adewumi
 
Nuclear Genomes(Short Answers and questions)
Nuclear Genomes(Short Answers and questions)Nuclear Genomes(Short Answers and questions)
Nuclear Genomes(Short Answers and questions)Zohaib HUSSAIN
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21smithbio
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptsherylbadayos
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfNoraCRuizGuevara
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Mark Pallen
 
Molecular systematics.pdf
Molecular systematics.pdfMolecular systematics.pdf
Molecular systematics.pdfAartisoni17
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomicsSukhjinder Singh
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.pptOmerBushra4
 

Similar to 2014 whitney-research (20)

Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
 
Nuclear Genomes(Short Answers and questions)
Nuclear Genomes(Short Answers and questions)Nuclear Genomes(Short Answers and questions)
Nuclear Genomes(Short Answers and questions)
 
2014 naples
2014 naples2014 naples
2014 naples
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
CROP GENOME SEQUENCING
CROP GENOME SEQUENCINGCROP GENOME SEQUENCING
CROP GENOME SEQUENCING
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
THE human genome
THE human genomeTHE human genome
THE human genome
 
GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.ppt
 
Bliss
BlissBliss
Bliss
 
Clase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdfClase 2 - Genoma Humano proyecto conicet.pdf
Clase 2 - Genoma Humano proyecto conicet.pdf
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Genome structure
Genome structure Genome structure
Genome structure
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
 
Molecular systematics.pdf
Molecular systematics.pdfMolecular systematics.pdf
Molecular systematics.pdf
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomics
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.ppt
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

2014 whitney-research

  • 1. Like the dog that caught the bus: now what? Sequencing, Big Data, and Biology C. Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University Feb 2014 ctb@msu.edu
  • 2. The challenges of non-model transcriptomics  Missing or low quality genome reference.  Evolutionarily distant.  Most extant computational tools focus on model organisms –  Assume low polymorphism (internal variation)  Assume reference genome  Assume somewhat reliable functional annotation  More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  • 3. Isoform analysis – some easy…
  • 4. Isoform analysis – some hard Counting methods mostly rely on presence of unique sequence to which to map.
  • 5. Types of Alternative Splicing 40% 25% <5%, more in plants, fungi, protozoa Karen H, Lev-Maor G & Ast G Nat Genet 2010
  • 7. Genome-reference-free assembly leads to many isoforms. Massive redundancy!
  • 8. Gene models can be “collapsed” given genomic sequence.
  • 9. In sum,  mRNAseq is pretty easy to deal with if you have a good genomic sequence.  We don‟t have a good genomic sequence for many organisms, including lamprey.  We need to do de novo assembly to construct a transcriptome from short reads.  We also have lots and lots of mRNAseq sequence:
  • 10. The problem of lamprey…  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  • 11. Sea lamprey in the Great Lakes  Non-native  Parasite of medium to large fishes  Caused populations of host fishes to crash Li Lab / Y-W C-D
  • 12. The problem of lamprey…  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  • 13. Lamprey has incomplete genomic sequence Evidence of somatic recombination; 100s of mb of sequence eliminated from genome during development. More recent evidence (unpub, J. Smith et al.) suggests that this loss is developmentally regulated, results in changes in gene expression (due to loss of genes!), and is tissue specific. Liver genome is not the entire genome. J. Smith et al., PNAS 2009
  • 14. Lamprey tissues for which we have mRNAseq embryo stages (late blastula, gastrula, neurula, 22b, n eural-crest migration, 24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female head skin preovulatory female eye adult intestine metamorphosis 4 (intestine, kidney) preovulatory female tail skin adult kidney metamorphosis 5 (liver, intestine, kidney) brain paired metamorphosis 6 (intestine, kidney) prespermiating male gill freshwater (gill, intestine, kidney) metamorphosis 7 (intestine, kidney) mature adult male rope tissue larval (gill, kidney, liver, intestine) monocytes juvenile (intestine, liver, kidney) brain (0,3,21 dpi) lips spinal cord (0.3.21 dpi) metamorphosis 1 (intestine, kidney) metamorphosis 2 (liver, intestine, spermiating male muscle spermiating male gill spermiating male head skin supraneural tissue small parasite distal intestine, kidney, proximal intestine salt water (gill, intestine)
  • 15. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 16. Shared low-level transcripts may not reach the threshold for assembly.
  • 17. Two problems:  We have a massive amount of data that challenges existing computers, and we want to assemble it all together.  We need to construct transcript families (to collapse isoforms) without having a solid reference genome.
  • 18. Solution 1: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 25. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of regions. => Enables analyses that are otherwise completely impossible.
  • 26. Solution 2: Partitioning transcripts into “transcript families” Transcript family Pell et al., 2012, PNAS
  • 27. Transcriptome results - lamprey  Started with 5.1 billion reads from 50 different tissues. (4 years of computational research, and about 1 month of compute time, GO HERE) Ended with:
  • 28. Lamprey transcriptome basic stats  616,000 transcripts (!)  263,000 transcript families (!) (This seems like a lot.)
  • 29. Lamprey transcriptome basic stats  616,000 transcripts  263,000 transcript families  Only 20436 transcript families have transcripts > 1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts.
  • 30. Validation -Assume computers lie. How do we judge precision & recall? 1) Homology! Do we see sequence similarity to e.g. mouse sequences? 1) Orthogonal data sets and analyses For example, look at sperm genome, or independently cloned CDS.
  • 31. Evolution: mouse  58,000 lamprey transcript families have some matches to mouse.  10,000 putative orthologs (reciprocal best hits) So that‟s a pretty good sign. (expecting about ~30k total genes) Conclusion: These numbers “feel” good to me; hard to know what to expect after ~350-500 mya.
  • 32. Orthogonal data set: pm2 (liver genome)  64% of our new transcript families have a match in pm2.  71% of conserved transcript families have a match in pm2.  83% of long transcripts have a match in pm2. Good – we don‟t expect 100%, because we know pm2 is probably missing stuff. So that means: Conclusion: At least 64% of transcript families are “really lamprey” (and > 83% of the long transcripts!)
  • 33. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNAs have a match in sperm genome. So sperm genome is “pretty good” for cross validation. But only  71% of our new transcript families have a match in sperm genome. ??
  • 34. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNAs have a match in sperm genome. New transcriptome:  71% of transcript families have a match in sperm genome.  92% (!!) of long transcript families have a match in sperm genome. (Since the sperm genome is low coverage, this length dependence makes sense – the longer the
  • 35. Orthogonal data set: sperm genome  94.2% of ref-based transcripts have a match in sperm genome.  98.2% of full-length cDNA have a match in sperm genome. New transcriptome:  71% of new transcriptome families have a match in sperm genome.  92% (!!) of long transcript families have a match in sperm genome. Conclusion: Our is poorer than but comparable
  • 36. Orthogonal data set: full-length cDNAs  We can look at both precision and recall by asking  Are known sequences represented completely by a single transcript? (“best match”)  Are known sequences covered by one or more transcripts? (“total matches”) 70% 90%
  • 37. Best matches – not great.
  • 38. Total matches – better!
  • 39. Ref-based (lamp0) ”best” are better than new assembly (lamp3)
  • 40. lamp3 “total” is better than lamp0
  • 41. Conclusions from full-length cDNA  Ref-based data set has longer “best matches” (better precision; less fragmented)  De novo assembly is more sensitive overall (better recall; contains more real sequences)
  • 42. Mapping percentages (with orthogonal data) Ona Bloom generated more data; how much maps? Ref-based New/all New/long BR SC 29.20% 42.94% 100.00%100.00% 45.99% 46.89% Conclusion: Ref-based is considerably less “complete” than new, de novo transcriptome assembly.
  • 43. Lamprey transcriptome conclusions  A substantial portion of the new transcriptome seems “good”:  58k transcript families with mouse homology, 10k     orthologs; 20k transcript families with transcripts > 1kb. Good matches to liver genome & sperm genome. Reasonable numbers ~mouse. Much (!) better than ref-based for mapping. (2x as good) But!  Poor recall of known full-length cDNA !?  240k partitions with only small sequences !? => microbial contamination?
  • 44. Separate question: how much of the pm2 genome is missing??  64% of lamp3 transcript families match to pm2.  82.5% of long transcript families match to pm2.  71% of lamp3 transcript families conserved with mouse match in pm2. Conclusion I: Probably about 30% of genic sequence is missing.
  • 45. Separate question: how much of the pm2 genome is missing??  64% of lamp3 transcript families match to pm2.  82.5% of long transcript families match to pm2.  71% of lamp3 transcript families conserved with mouse match in pm2.  22.5% of sperm genome contigs have no hits in pm2. Conclusion II (firmer): About 30% of single-copy sequence is missing.
  • 46. CEGMA based completeness estimates (Core eukaryotic genes) Number seqs Completeness / 100% matches Completeness / partial matches lamp3 entire 620k lamp3 all ORFs > 80aa 269k lamp3 longest ORF in tr 80k 70.6 96.4 46.4 89 41.1 77.8 lamp0 44.7 62.5 11k Camille Scott
  • 47. Looking at the Molgula… Putnam et Modified al., 2008, Nature. from Swalla 2001
  • 48. What do these animals look like? Molgula oculata Molgula oculata Molgula occulta Ciona intestinalis
  • 49. Tail loss and notochord genes a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
  • 50. Diginorm applied to Molgula embryonic mRNAseq No.$ reads Reads$ of$ kept M.# occulta$ F+3 M.# occulta$ F+3 M.# occulta$ F+4 M.# occulta$ F+5 M.# occulta$ F+6 M.# occulta!Total M.# oculata$ F+3 M.# oculata$ F+4 M.# oculata$ F+6 M.# oculata!Total 42,174,510 50,018,302 44,948,983 53,692,296 45,782,981 236,617,072 47,045,433 52,890,938 50,156,895 150,093,266 15,642,268 6,012,894 3,499,935 2,993,715 2,774,342 30,923,154 10,754,899 3,949,489 2,874,196 17,578,584 Percentage$ kept ? ? ? ? ? 13% ? ? ? 11.70%
  • 51. Question: does normalization “lose” transcript information? M. occulta Diginorm Raw 37 C. intestinalis 13623 M. oculata Diginorm Raw 17 missing 2446 64 C. intestinalis 13646 15 missing 2398 Reciprocal best hit vs. Ciona Blast e-value cutoff: 1e-6 Elijah Lowe
  • 52. Transcriptome assembly thoughts  We can (now) assemble really big data sets, and get pretty good results.  We have lots of evidence (some presented here :) that some assemblies are not strongly affected by digital normalization.
  • 53. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
  • 54. 1. khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
  • 55. CC0; BSD; on github; in reStructuredText.
  • 56. 2. Data availability is important for annotating distant sequences no similarity Anything else Mollusc Cephalopod
  • 57. Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people‟s existing data for free, IFF they open it up within a year. See: • CephSeq white paper. • “Dead Sea Scrolls & Open Marine Transcriptome Project” blog post;
  • 58. First results: Loligo genomic/transcriptome resources Putting other people‟s sequences where my mouth is:
  • 59. Tools to routinely update metazoan orthology/homology relationships  > 100 mRNAseq data sets already;  Build interconnections between them via homology;  Build tools to update interconnections as new data sets arrive.  Provide raw data, processed data, underlying tools, simple Web interface, all CC0/in da cloud/open/reproducible. (Question: what biology problems could we tackle?)
  • 60. “Research singularity” The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.
  • 61.
  • 62. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 63. Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Josh Rosenthal (UPR)  Weiming Li, MSU  Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Funding Buxbaum (MSSM) USDA NIFA; NSF IOS; NIH; BEACON.

Editor's Notes

  1. Transcripts are then mapped back to the chicken genome. Because the transcripts are mature mRNA, only exons will map to the genome.The solid boxes represent exons.As shown in this figure, different isoforms are detected using different parameter settings. The explanation for this phenomenon is unknown. The goal of this step is to define all exons in each gene.
  2. Larvae/stream bottoms 3-6 years; parasitic adult -&gt; great lakes, 12-20 months feeding. 5-8 years. 40 lbs of fish per life as parasite. 98% of fish in great lakes went away!
  3. XXX
  4. Marine invertebrates, chordata phylum, notochord, hollow dorsal nerve cord, pharyngeal slits and a post anal tail at some point in life. colonial, filter feeders
  5. Notochord cells present, do not intercalate or extend