SlideShare a Scribd company logo
1 of 63
AWager for 2016: How
SoftwareWill Beat Hardware
in Biological Data Analysis
C.Titus Brown
Associate Professor
PHR, School ofVeterinary Medicine, UC Davis
This talk on slideshare: slideshare.net/c.titus.brown/
This talk idea started with an argument on the
Internet.
xkcd.com/386/ - “Duty Calls”
https://twitter.com/ctitusbrown/status/535191544119451648
The obligatory slide about abundant
sequencing data.
http://www.genome.gov/sequencingcosts/
Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is-
still-going-down/
Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how
to analyze data from CERN and Sloan Digital
Sky Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop,
and Spark, dude. Just map-reduce it.”
3) Develop custom approaches.
Shotgun sequencing
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of
foolishness
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
The scale of the problem (1)
Lots of data per “book”
• A human genome contains approximately 6
billion bases of DNA.
• Covering the entire genome using random
sampling requires ~150 billion bases of
sequencing
The scale of the problem (2)
Many “editions” in e.g. cancer
If you want to look at 1000 individual tumor
cells and build an evolutionary history of
changes, you need 150 Gbp per cell: 150Tbp.
The scale of the problem (3)
Many sequencers, many analyses.
• 10,000 sequencers worldwide (?)
• Worldwide sequencing capacity ??, but
~300,000 human genomes in 2014…
• Many research groups, each with own
question(s) - ~1m data sets each year?
• Cheap! ~$10-20k for a 100 Gbp data set.
Resequencing analysis
We know a reference genome (specific edition), and
want to find variants (differences - blue) in a
background of errors (red)
Mapping: locate reads in reference
(pass 1)
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
Variant detection after mapping
(pass 2, 3, and 4)
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
The current variant calling approach:
Map reads
Convert to
binary
Sort binary
format by
genome pos'n
"Pile up" and
call variants
Extract reads
for tricky bits
Realign/
assemble
(optional)
Current approach: pros and cons
Pros:
• Modular and flexible.
• Open source!Well supported! Mature!
• Some of it parallelizes easily!
Cons:
• 4+ passes across the data
• Very I/O intensive (hence unsuitable for cloud).
Some numbers:
• 1000 single cells from a tumors ~ 150Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling requires ~2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in
one month.
…but, multiply problem by # of possible patients...
Big Sequencing Data and Biology
1) Listen to the physicists: “look, we know how to
analyze data from CERN and Sloan Digital Sky
Survey. Just do what we did.”
2) Listen to the SiliconValley folk: “Hadoop, and Spark,
dude. Just map-reduce it.”
3) Develop better custom approaches, swiping ideas
from SiliconValley and physicists as needed.
So, back to the Internet argument:
it ended with a bet.
In two years (Nov 2016), my 9 year old daughter
will be able to analyze a full human genome
sequence on her desktop computer.
https://twitter.com/ctitusbrown/status/535191544119451648
“Never compete unless you have an unfair advantage.”
1. My daughter is awesome.
2.We know how to do it
already*
(* some assembly required)
3. Heng Li just posted a
preprint yesterday!
“FermiKit”, http://arxiv.org/abs/1504.06574
Remainder of talk – outline.
1. “Data” vs “information”
2. Streaming approaches to lossy compression
and building compressible graphs for soil
metagenomics.
3. Sequencing errors and variants using graphs.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs (sequencing graphs) scale with
data size, not information size.
Why do sequence graphs scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Practical memory measurements
Velvet measurements (Adina Howe)
Our solution: lossy compression
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Graph sizes now scales with information content.
Most samples can be reconstructed via de
novo assembly on commodity computers.
Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic
perspective:
– Discards 95% more of data for genomes.
– Loses < 00.02% of information.
This changes the way analyses
scale.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.
It took four years to figure out which three lines, though…
Diginorm can detect information
saturation in a stream.
Zhang et al., submitted.
This generically permits semi-streaming
analytical approaches.
Zhang et al., submitted.
e.g. E. coli analysis => ~1.2 pass, sublinear
memory
Zhang et al., submitted.
Another simple algorithm.
Zhang et al., submitted.
Single pass, reference free, tunable, streaming online
variant calling.
Error detection  variant calling
Real time / streaming data
analysis.
Raw data
(real time, from
sequencer?)
Error trimming
Variant calling
De novo
assembly
Stream all the things!
This code works.
Preliminary benchmarks -
• Can do variant calling on E. coli in about 5
minutes, in 40 MB of RAM, with a single
thread, with no optimization.
• Scaling to human should be readily feasible.
• …I have another 18 months before I lose the
bet.
My real point -
• We need well founded, and flexible, and algorithmically
efficient, and high performance components for
sequence data manipulation in biology.
• We are building these on top of a streaming and low
memory paradigm.
• We are building out a scripting library for composing
these operations.
Scaling compute, or algorithms?
There are some problems that require big computers &
many processors.
Genomic data analysis shouldn’t be one of them, based
on information content alone!
(This is probably good, given the scale of the need.)
Many other biological problems do require big compute,
however.
Reminder: the real challenge is
understanding
We have gotten distracted by shiny toys: sequencing!!
Data!!
Data is now plentiful! But:
We typically have no knowledge of what > 50% of an
e.g. environmental metagenome “means”,
functionally.
http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
Via @adrianholovaty
I was going to give you my 5 year
vision…
…but I don’t have 20/20 eyesight.
(20/20? 2020? 2015 + 5?)
(My wife has asked that I apologize for this
joke.)
Via @adrianholovaty
Data integration as a next
challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic, metabolomic,
…?)
How do we explore these data sets?
Registration, cross-validation, integration with
models…
Carbon cycling in the ocean -
“DeepDOM” cruise, Kujawinski & Longnecker et al.
Integrating many different data types to
build understanding.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
“DeepDOM” cruise: examination of dissolved organic matter & microbial
metabolism vs physical parameters – potential collab.
Data/analysis lifecycle
A few thoughts on practical next
steps.
• Enable scientists with better tools.
• Train a bioinformatics “middle class.”
• Accelerate science via the open science “network
effect”.
That is… what do we do now?
Once you have all this data, what do you do?
"Business as usual simply cannot work.”
- David Haussler, 2014
Looking at millions to billions of (human) genomes in
the next 5-10 years.
Enabling scientists with better tools -
Build robust, flexible computational
frameworks for data exploration, and make
them open and remixable.
Develop theory, algorithms, & software
together, and train people in their use.
(Stop pretending that we can develop “black
boxes” that will give you the right answer.)
Education and training - towards a
bioinformatics “middle class”
Biology is underprepared for data-intensive investigation.
We must teach and train the next generations.
=> Build a cohort of “data intensive biologists” who can use
data and tools as an intrinsic and unremarkable part of their
research.
~10-20 workshops / year, novice -> masterclass; open
materials.
dib-training.rtfd.org/
Can open science trigger a
“network effect”?
http://prasoondiwakar.com/wordpress/trivia/the-network-effect
So: can we drive data sharing via a decentralized
model, e.g. a distributed graph database?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
My larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing immediate utility;
frictionless sharing.
Permissionless innovation for e.g. new data mining
approaches.
Plan for poverty with federated infrastructure built on open &
cloud.
Solve people’s current problems, while remaining agile for
the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html
Thanks!
Please contact me at ctbrown@ucdavis.edu!

More Related Content

What's hot

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 

What's hot (20)

2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 

Viewers also liked

2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
Chromosome :: Properties
Chromosome :: PropertiesChromosome :: Properties
Chromosome :: Properties
rejita
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEW
avlainich
 
Getting results when working with english result
Getting results when working with english resultGetting results when working with english result
Getting results when working with english result
emege68
 
pycon 2012 tip bof -- intro slides
pycon 2012 tip bof -- intro slidespycon 2012 tip bof -- intro slides
pycon 2012 tip bof -- intro slides
c.titus.brown
 
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
Zafar Ahmad
 

Viewers also liked (20)

Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
 
House Hunting fun
House Hunting funHouse Hunting fun
House Hunting fun
 
Point Dynamics Our Story
Point Dynamics   Our StoryPoint Dynamics   Our Story
Point Dynamics Our Story
 
Tinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugtTinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugt
 
MoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - AgendaMoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - Agenda
 
Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012Ondernemen kwf 26 nov 2012
Ondernemen kwf 26 nov 2012
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Chromosome :: Properties
Chromosome :: PropertiesChromosome :: Properties
Chromosome :: Properties
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEW
 
Getting results when working with english result
Getting results when working with english resultGetting results when working with english result
Getting results when working with english result
 
Software Quality Df
Software Quality DfSoftware Quality Df
Software Quality Df
 
pycon 2012 tip bof -- intro slides
pycon 2012 tip bof -- intro slidespycon 2012 tip bof -- intro slides
pycon 2012 tip bof -- intro slides
 
Is There A Correlation
Is There A CorrelationIs There A Correlation
Is There A Correlation
 
Do You Know The 11g Plan?
Do You Know The 11g Plan?Do You Know The 11g Plan?
Do You Know The 11g Plan?
 
Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219Intellisoft introductionrecruitment l120219
Intellisoft introductionrecruitment l120219
 
Morsø erhversråd energimærkning
Morsø erhversråd   energimærkningMorsø erhversråd   energimærkning
Morsø erhversråd energimærkning
 
Loco Legacy Mini-Update
Loco Legacy Mini-UpdateLoco Legacy Mini-Update
Loco Legacy Mini-Update
 
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
FER-2010 FINAL Consolidated Progress Report f1 11-03-2011
 
Sdarticle3
Sdarticle3Sdarticle3
Sdarticle3
 
Morsø kommune og landbruget
Morsø kommune og landbrugetMorsø kommune og landbruget
Morsø kommune og landbruget
 

Similar to 2015 illinois-talk

2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 

Similar to 2015 illinois-talk (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformatics
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 

More from c.titus.brown (15)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Recently uploaded

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 

2015 illinois-talk

  • 1. AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis C.Titus Brown Associate Professor PHR, School ofVeterinary Medicine, UC Davis This talk on slideshare: slideshare.net/c.titus.brown/
  • 2. This talk idea started with an argument on the Internet. xkcd.com/386/ - “Duty Calls”
  • 4. The obligatory slide about abundant sequencing data. http://www.genome.gov/sequencingcosts/ Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is- still-going-down/
  • 5. Big Sequencing Data and Biology 1) Listen to the physicists: “look, we know how to analyze data from CERN and Sloan Digital Sky Survey. Just do what we did.” 2) Listen to the SiliconValley folk: “Hadoop, and Spark, dude. Just map-reduce it.” 3) Develop custom approaches.
  • 6. Shotgun sequencing It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness
  • 7. Resequencing analysis We know a reference genome (specific edition), and want to find variants (differences - blue) in a background of errors (red)
  • 8. The scale of the problem (1) Lots of data per “book” • A human genome contains approximately 6 billion bases of DNA. • Covering the entire genome using random sampling requires ~150 billion bases of sequencing
  • 9. The scale of the problem (2) Many “editions” in e.g. cancer If you want to look at 1000 individual tumor cells and build an evolutionary history of changes, you need 150 Gbp per cell: 150Tbp.
  • 10. The scale of the problem (3) Many sequencers, many analyses. • 10,000 sequencers worldwide (?) • Worldwide sequencing capacity ??, but ~300,000 human genomes in 2014… • Many research groups, each with own question(s) - ~1m data sets each year? • Cheap! ~$10-20k for a 100 Gbp data set.
  • 11. Resequencing analysis We know a reference genome (specific edition), and want to find variants (differences - blue) in a background of errors (red)
  • 12. Mapping: locate reads in reference (pass 1) http://en.wikipedia.org/wiki/File:Mapping_Reads.png
  • 13. Variant detection after mapping (pass 2, 3, and 4) http://www.kenkraaijeveld.nl/genomics/bioinformatics/
  • 14. The current variant calling approach: Map reads Convert to binary Sort binary format by genome pos'n "Pile up" and call variants Extract reads for tricky bits Realign/ assemble (optional)
  • 15. Current approach: pros and cons Pros: • Modular and flexible. • Open source!Well supported! Mature! • Some of it parallelizes easily! Cons: • 4+ passes across the data • Very I/O intensive (hence unsuitable for cloud).
  • 16. Some numbers: • 1000 single cells from a tumors ~ 150Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling requires ~2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month. …but, multiply problem by # of possible patients...
  • 17. Big Sequencing Data and Biology 1) Listen to the physicists: “look, we know how to analyze data from CERN and Sloan Digital Sky Survey. Just do what we did.” 2) Listen to the SiliconValley folk: “Hadoop, and Spark, dude. Just map-reduce it.” 3) Develop better custom approaches, swiping ideas from SiliconValley and physicists as needed.
  • 18. So, back to the Internet argument: it ended with a bet. In two years (Nov 2016), my 9 year old daughter will be able to analyze a full human genome sequence on her desktop computer. https://twitter.com/ctitusbrown/status/535191544119451648
  • 19. “Never compete unless you have an unfair advantage.” 1. My daughter is awesome. 2.We know how to do it already* (* some assembly required) 3. Heng Li just posted a preprint yesterday! “FermiKit”, http://arxiv.org/abs/1504.06574
  • 20. Remainder of talk – outline. 1. “Data” vs “information” 2. Streaming approaches to lossy compression and building compressible graphs for soil metagenomics. 3. Sequencing errors and variants using graphs.
  • 21. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs (sequencing graphs) scale with data size, not information size.
  • 22. Why do sequence graphs scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 23. Practical memory measurements Velvet measurements (Adina Howe)
  • 24. Our solution: lossy compression Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 25. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 26. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 27. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 28. But! Shotgun sequencing is very redundant! Lots of the high coverage simply isn’t needed. (unnecessary data)
  • 35. Graph sizes now scales with information content. Most samples can be reconstructed via de novo assembly on commodity computers.
  • 36. Diginorm ~ “lossy compression” Nearly perfect from an information theoretic perspective: – Discards 95% more of data for genomes. – Loses < 00.02% of information.
  • 37. This changes the way analyses scale. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 38. Streaming lossy compression: for read in dataset: if estimated_coverage(read) < CUTOFF: yield read This is literally a three line algorithm. Not kidding. It took four years to figure out which three lines, though…
  • 39. Diginorm can detect information saturation in a stream. Zhang et al., submitted.
  • 40. This generically permits semi-streaming analytical approaches. Zhang et al., submitted.
  • 41. e.g. E. coli analysis => ~1.2 pass, sublinear memory Zhang et al., submitted.
  • 42. Another simple algorithm. Zhang et al., submitted.
  • 43. Single pass, reference free, tunable, streaming online variant calling. Error detection  variant calling
  • 44. Real time / streaming data analysis. Raw data (real time, from sequencer?) Error trimming Variant calling De novo assembly
  • 45. Stream all the things! This code works.
  • 46. Preliminary benchmarks - • Can do variant calling on E. coli in about 5 minutes, in 40 MB of RAM, with a single thread, with no optimization. • Scaling to human should be readily feasible. • …I have another 18 months before I lose the bet.
  • 47. My real point - • We need well founded, and flexible, and algorithmically efficient, and high performance components for sequence data manipulation in biology. • We are building these on top of a streaming and low memory paradigm. • We are building out a scripting library for composing these operations.
  • 48. Scaling compute, or algorithms? There are some problems that require big computers & many processors. Genomic data analysis shouldn’t be one of them, based on information content alone! (This is probably good, given the scale of the need.) Many other biological problems do require big compute, however.
  • 49. Reminder: the real challenge is understanding We have gotten distracted by shiny toys: sequencing!! Data!! Data is now plentiful! But: We typically have no knowledge of what > 50% of an e.g. environmental metagenome “means”, functionally. http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  • 50. I was going to give you my 5 year vision… …but I don’t have 20/20 eyesight. Via @adrianholovaty
  • 51. I was going to give you my 5 year vision… …but I don’t have 20/20 eyesight. (20/20? 2020? 2015 + 5?) (My wife has asked that I apologize for this joke.) Via @adrianholovaty
  • 52. Data integration as a next challenge In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) How do we explore these data sets? Registration, cross-validation, integration with models…
  • 53. Carbon cycling in the ocean - “DeepDOM” cruise, Kujawinski & Longnecker et al.
  • 54. Integrating many different data types to build understanding. Figure 2. Summary of challenges associated with the data integration in the proposed project. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
  • 56. A few thoughts on practical next steps. • Enable scientists with better tools. • Train a bioinformatics “middle class.” • Accelerate science via the open science “network effect”.
  • 57. That is… what do we do now? Once you have all this data, what do you do? "Business as usual simply cannot work.” - David Haussler, 2014 Looking at millions to billions of (human) genomes in the next 5-10 years.
  • 58. Enabling scientists with better tools - Build robust, flexible computational frameworks for data exploration, and make them open and remixable. Develop theory, algorithms, & software together, and train people in their use. (Stop pretending that we can develop “black boxes” that will give you the right answer.)
  • 59. Education and training - towards a bioinformatics “middle class” Biology is underprepared for data-intensive investigation. We must teach and train the next generations. => Build a cohort of “data intensive biologists” who can use data and tools as an intrinsic and unremarkable part of their research. ~10-20 workshops / year, novice -> masterclass; open materials. dib-training.rtfd.org/
  • 60. Can open science trigger a “network effect”? http://prasoondiwakar.com/wordpress/trivia/the-network-effect
  • 61. So: can we drive data sharing via a decentralized model, e.g. a distributed graph database? Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 62. My larger research vision: 100% buzzword compliantTM Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future. ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 63. Thanks! Please contact me at ctbrown@ucdavis.edu!

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. High coverage is essential.
  3. High coverage is essential.
  4. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  5. Taking advantage of structure within read
  6. Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  7. Analyze data in cloud; import and export important; connect to other databases.
  8. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.