SlideShare a Scribd company logo
1 of 56
WHAT’S AHEAD FOR 
BIOLOGY? 
THE DATA INTENSIVE FUTURE 
C. Titus Brown 
ctb@msu.edu 
Assistant Professor, Michigan State University 
(In January, moving to UC Davis / VetMed.) 
Talk slides on slideshare.net/c.titus.brown
The Data Deluge 
(a traditional requirement for these talks)
The short version 
• Data gathering & storage is growing, leaps & bounds! 
• Biology is completely unprepared for this at every level: 
• Technical and infrastructure 
• Cultural 
• Training 
• Our funding/incentivization/prioritization structures are also 
largely unprepared. 
• This is a huge missed opportunity!! 
(What does Titus think we should be doing?)
Challenges: 
1. Dealing with Big Data (my current research) 
2. Interpreting the unknowns (future research) 
3. Accelerating research with better data/methods/ 
results sharing. 
4. Expanding the role of exploratory data 
analysis in biology. (career windmill)
1. Dealing with Big Data 
A. Lossy compression 
B. Streaming algorithms
Looking forward 5 years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
Approach A: Lossy compression 
(Reduce volume of data & compute infrastructure 
requirements) 
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Lossy compression can substantially 
reduce data size while retaining 
information needed for later (re)analysis.
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
e.g. de novo assembly now scales with richness, not 
diversity. 
• 10-100 fold decrease in memory requirements 
• 10-100 fold speed up in analysis 
Brown et al., arXiv, 2012
Hey, cool, our approach and software is 
used by Illumina for long-read sequencing!
Our general strategy: compressive prefilters 
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Save in cold storage 
Save for reanalysis, 
investigation.
Approach B: streaming data analysis 
(Reduce latency for clinical applications) 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
Current variant calling approaches are multipass 
Data 
Mapping 
Sorting 
Calling Answer
Streaming graph-based approaches can 
detect information saturation
Approach supports compute-intensive 
interludes – remapping, etc. 
Rimmer et al., 2014
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet? 
Decrease latency!
So, how do we deal with Big Data issues? 
• Fairly record cost of data analysis (running software & 
cost of computational infrastructure) 
• This incentivizes development of better approaches! 
• Lossy compression, streaming, …?? 
• Think 5 years ahead, rather than 2 years behind! 
• Pay attention to workflows, software lifecycle, etc. etc. 
(See ABiC 2014 talk :)
2. Dealing with the unknowns 
Compare 
sample to 
control 
Eliminate all 
the things 
we don't 
know how to 
interpret 
Wonder why 
we can only 
account for 
~50% effect. 
~millions 
~10s of thousands
“What is the function of ….?” 
We can observe almost everything at a DNA/RNA level! 
But, 
• Experimentally based functional annotations are sparse; 
• Most genes play multiple roles and are generally 
annotated for only one; 
• Model organisms are phylogenetically quite limited and 
biased; 
• …there is little or no $$$ or reputation gain for 
characterizing novel genes (and nor is it straightforward 
or easy to do so!)
The problem of lopsided gene characterization: 
e.g., the brain "ignorome" 
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression 
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. 
The major distinguishing characteristic between these sets of genes is date of discovery, early 
discovery being associated with greater research momentum—a genomic bandwagon effect." 
Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
How do we systematically broaden our 
functional understanding of genes? 
1. More experimental work! 
• Population studies, perturbation studies, good ol’ fashioned 
molecular biology, etc. 
2. Integrate modeling, to see where we have (or lack) 
sufficiency of knowledge for a particular phenotype. 
3. Sequence it all and let the bioinformaticians sort it out! 
What I think will work best: a tight integration between all 
three approaches (c.f. physics) – hypothesis-driven 
investigation, modeling, and exploratory data science. 
See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
3. Accelerating research with better 
sharing of results, data, methods. 
Our current journal system is a 20th century solution to a 
17th century problem. 
- Paraphrased from Cameron Neylon 
(Note: 20th century was LAST century)
3. Accelerating research with better 
sharing of results, data, methods. 
We could accelerate research with better sharing. 
Recent example re rare diseases: 
http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind- 
2 
“The current academic publication system does patients an 
enormous disservice.” – Daniel MacArthur 
There are many barriers to better communication of results, 
data, and methods, but most of them are cultural, not 
technical. (Much harder!)
Preprints 
• Many fields (including bioinformatics and increasingly 
genomics) routinely share papers prior to publication. This 
facilitates reproduction, dissemination, and ultimately 
progress. 
• Biology is behind the times! 
See: 
1. Haldane’s Sieve (blog discussion of preprints) 
2. Evidence that preprints confer massive citation 
advantage in physics (http://arxiv.org/abs/0906.5418)
Current model for data sharing 
In a data limited world, 
this kind of made sense. 
Gather 
data 
Interpret 
data 
Write & 
publish 
paper 
Grudgingly 
share as 
little data 
as 
possible
Current model for data sharing 
This model ignores the fact that data 
often has multiple (unrealized or 
serendipitous) uses. 
(Among many other problems ;) 
Gather 
data 
Interpret 
data 
Write & 
publish 
paper 
Grudgingly 
share as 
little data 
as 
possible
The train wreck ahead 
Gather 
data 
Interpret 
data 
most data doesn’t get published, 
Write & 
publish 
paper 
When data is cheap, and 
interpretation is expensive, 
and therefore is lost. 
Grudgingly 
share as 
little data 
as 
possible 
(Program managers are not a fan of this)
Data sharing challenges - 
• Little immediate or career rewards for sharing data; 
incentives are almost entirely punitive (if you DON’T…) 
• Sharing data in a usable form is still rather difficult. 
• Submitting data to archival services is, in many cases, 
surprisingly difficult. 
• Few methods for gaining recognition for data sharing prior 
to publication of conclusions.
The Ocean Cruise Model 
One really expensive cruise, many data collectors, shared data. 
DeepDOM – photo courtesy E. Kujawinski, WHOI
Sage Bionetworks / “walled garden” 
Collaborative data sharing policy with restricted access to 
outsiders; 
Central platform with analysis provenance tracking; 
A model for the future of biomedical research? 
See, e.g., Enabling transparent and collaborative computational analysis 
of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
Distributed cyberinfrastructure to encourage sharing? 
Web interface + API 
Compute server 
(Galaxy? 
Arvados?) 
Data/ 
Info 
Raw data sets 
Public 
servers 
"Walled 
garden" 
server 
Private 
server 
Graph query layer 
Upload/submit 
(NCBI, KBase) 
Import 
(MG-RAST, 
SRA, EBI) 
ivory.idyll.org/blog/2014-moore-ddd-talk.html
Better metadata collection is needed! 
Suppose the NSA could EITHER track 
who was calling whom, 
OR what they were saying – which would 
be more valuable? 
Who? What? Who?
Better metadata collection is needed! 
Suppose the NSA could EITHER track 
who was calling whom, 
OR what they were saying – which would 
be more valuable? 
Who? What? Who?
Better metadata collection is needed! 
We need to track sample origin, 
phenotype/environmental conditions, etc. 
Sample information The –omic data Phenotype 
This will facilitate discovery, serendipity, re-analysis, and cross-validation.
Data and software citation 
Now methods for: 
• assigning DOIs to data (which makes it citable) – figshare, 
dryad. 
• Data publications – gigascience, SIGS, Scientific Data. 
• Software citations – Zenodo, MozSciLab/GitHub 
• Software publications – F1000 Research 
Will this address the need to incentivize data sharing and 
methods? Probably not but it’s a good start ;)
4. Exploratory data analysis 
Old model: 
Gather 
data 
Interpret 
data
New model 
Your data is most useful when combined with everyone 
else’s. 
Gather 
data 
Interpret 
data 
Database 
Database 
Other 
people's 
data 
Database 
Database 
Database 
Other 
people's 
data Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
Database data 
Database 
Database 
Database
Given enough publicly accessible data… 
Interpret 
data 
Database 
Database 
Other 
people's 
data 
Database 
Database 
Database 
Other 
people's 
data Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
data 
Other 
people's 
Database data 
Database 
Database 
Database
But: we face lack of training. 
The lack of training in data science is the biggest challenge 
facing biology. 
Students! There’s a great future in data analysis! 
Also see:
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014) 
Illumina estimate: 228,000 human genomes will be 
sequenced in 2014, mostly by researchers. 
http://www.technologyreview.com/news/53 
1091/emtech-illumina-says-228000- 
human-genomes-will-be-sequenced-this-year/
Looking to the future 
For the senior scientists and funders amongst us, 
• How do we incentivize data sharing, and training? 
• How do we fund the meso- and micro-scale 
cyberinfrastructure development that will accelerate bio? 
The NIH and NSF are exploring 
this; the Moore and Sloan 
foundations are simply doing it 
(but 1% the size). 
See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
Thanks for listening!
For Australian students and early career researchers 
in bioinformatics and computational biology 
Annual Student Symposium 
Friday 28th November 2014 
Parkville, Victoria 
Now accepting abstracts for talks and posters 
Talk abstracts close 31st October 
combine.org.au

More Related Content

What's hot

2013 bio it world
2013 bio it world2013 bio it world
2013 bio it worldChris Dwan
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
NGP Retreat Open Science 2015
NGP Retreat Open Science 2015NGP Retreat Open Science 2015
NGP Retreat Open Science 2015Jackie Wirz, PhD
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
2017 bio it world
2017 bio it world2017 bio it world
2017 bio it worldChris Dwan
 
The Future(s) of the World Wide Web
The Future(s) of the World Wide WebThe Future(s) of the World Wide Web
The Future(s) of the World Wide WebJames Hendler
 
Designing a synergistic relationship between undergraduate Data Science educa...
Designing a synergistic relationship between undergraduate Data Science educa...Designing a synergistic relationship between undergraduate Data Science educa...
Designing a synergistic relationship between undergraduate Data Science educa...Ciera Martinez
 
Jim Gray Award Lecture
Jim Gray Award LectureJim Gray Award Lecture
Jim Gray Award LecturePhilip Bourne
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for RealJames Hendler
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasMerce Crosas
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Towards An Improvement Community Platform for Service Innovation
Towards An Improvement Community Platform for Service InnovationTowards An Improvement Community Platform for Service Innovation
Towards An Improvement Community Platform for Service InnovationJack Park
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 

What's hot (20)

2013 bio it world
2013 bio it world2013 bio it world
2013 bio it world
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
NGP Retreat Open Science 2015
NGP Retreat Open Science 2015NGP Retreat Open Science 2015
NGP Retreat Open Science 2015
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
2017 bio it world
2017 bio it world2017 bio it world
2017 bio it world
 
The Future(s) of the World Wide Web
The Future(s) of the World Wide WebThe Future(s) of the World Wide Web
The Future(s) of the World Wide Web
 
Designing a synergistic relationship between undergraduate Data Science educa...
Designing a synergistic relationship between undergraduate Data Science educa...Designing a synergistic relationship between undergraduate Data Science educa...
Designing a synergistic relationship between undergraduate Data Science educa...
 
Jim Gray Award Lecture
Jim Gray Award LectureJim Gray Award Lecture
Jim Gray Award Lecture
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for Real
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Abcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosasAbcd iqs ssoftware-projects-mercecrosas
Abcd iqs ssoftware-projects-mercecrosas
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Data at the NIH
Data at the NIHData at the NIH
Data at the NIH
 
Towards An Improvement Community Platform for Service Innovation
Towards An Improvement Community Platform for Service InnovationTowards An Improvement Community Platform for Service Innovation
Towards An Improvement Community Platform for Service Innovation
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 

Viewers also liked

Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesKristina O'Regan
 
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...Kegler Brown Hill + Ritter
 
iPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamismiPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamismClément Escoffier
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonEgdares Futch H.
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition
 
Vrouwen In Het Management
Vrouwen In Het ManagementVrouwen In Het Management
Vrouwen In Het ManagementAydin Kintziger
 
Selling to Brazil, Chile & Colombia- Toolkit for Success
Selling to Brazil, Chile & Colombia- Toolkit for SuccessSelling to Brazil, Chile & Colombia- Toolkit for Success
Selling to Brazil, Chile & Colombia- Toolkit for SuccessKegler Brown Hill + Ritter
 
BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussionc.titus.brown
 
Education generatie z
Education generatie zEducation generatie z
Education generatie zPiet van Vugt
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Valueswenchein huang
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2ndshinkyung
 
Castello Normanno Di Adrano
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di AdranoYvonne Sgroi
 
How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?jessecadelina
 

Viewers also liked (20)

Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain Times
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Seismic Waves
Seismic WavesSeismic Waves
Seismic Waves
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
OSHA Goes On the Attack as the Obama Administration Winds Down: Are You Prepa...
 
iPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamismiPOJO 2.x - a tale about dynamism
iPOJO 2.x - a tale about dynamism
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en Bison
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
 
Roman roads
Roman roadsRoman roads
Roman roads
 
2013 arizona-swc
2013 arizona-swc2013 arizona-swc
2013 arizona-swc
 
Vrouwen In Het Management
Vrouwen In Het ManagementVrouwen In Het Management
Vrouwen In Het Management
 
Selling to Brazil, Chile & Colombia- Toolkit for Success
Selling to Brazil, Chile & Colombia- Toolkit for SuccessSelling to Brazil, Chile & Colombia- Toolkit for Success
Selling to Brazil, Chile & Colombia- Toolkit for Success
 
BOSC 2012 panel discussion
BOSC 2012 panel discussionBOSC 2012 panel discussion
BOSC 2012 panel discussion
 
Education generatie z
Education generatie zEducation generatie z
Education generatie z
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Values
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2nd
 
Hazed and Confused
Hazed and ConfusedHazed and Confused
Hazed and Confused
 
Castello Normanno Di Adrano
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di Adrano
 
Writing
WritingWriting
Writing
 
How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?
 

Similar to 2014 aus-agta

2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 

Similar to 2014 aus-agta (20)

2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
METRO RDM Webinar
METRO RDM WebinarMETRO RDM Webinar
METRO RDM Webinar
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 

More from c.titus.brown

More from c.titus.brown (20)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

Recently uploaded

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Recently uploaded (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

2014 aus-agta

  • 1. WHAT’S AHEAD FOR BIOLOGY? THE DATA INTENSIVE FUTURE C. Titus Brown ctb@msu.edu Assistant Professor, Michigan State University (In January, moving to UC Davis / VetMed.) Talk slides on slideshare.net/c.titus.brown
  • 2. The Data Deluge (a traditional requirement for these talks)
  • 3. The short version • Data gathering & storage is growing, leaps & bounds! • Biology is completely unprepared for this at every level: • Technical and infrastructure • Cultural • Training • Our funding/incentivization/prioritization structures are also largely unprepared. • This is a huge missed opportunity!! (What does Titus think we should be doing?)
  • 4. Challenges: 1. Dealing with Big Data (my current research) 2. Interpreting the unknowns (future research) 3. Accelerating research with better data/methods/ results sharing. 4. Expanding the role of exploratory data analysis in biology. (career windmill)
  • 5. 1. Dealing with Big Data A. Lossy compression B. Streaming algorithms
  • 6. Looking forward 5 years… Navin et al., 2011
  • 7. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 8. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 9. Approach A: Lossy compression (Reduce volume of data & compute infrastructure requirements) Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 21. e.g. de novo assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis Brown et al., arXiv, 2012
  • 22. Hey, cool, our approach and software is used by Illumina for long-read sequencing!
  • 23. Our general strategy: compressive prefilters Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 24. Approach B: streaming data analysis (Reduce latency for clinical applications) Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 25. Current variant calling approaches are multipass Data Mapping Sorting Calling Answer
  • 26. Streaming graph-based approaches can detect information saturation
  • 27. Approach supports compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 28. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 29. Integrate sequencing and analysis Sequencing Analysis Are we done yet? Decrease latency!
  • 30. So, how do we deal with Big Data issues? • Fairly record cost of data analysis (running software & cost of computational infrastructure) • This incentivizes development of better approaches! • Lossy compression, streaming, …?? • Think 5 years ahead, rather than 2 years behind! • Pay attention to workflows, software lifecycle, etc. etc. (See ABiC 2014 talk :)
  • 31. 2. Dealing with the unknowns Compare sample to control Eliminate all the things we don't know how to interpret Wonder why we can only account for ~50% effect. ~millions ~10s of thousands
  • 32. “What is the function of ….?” We can observe almost everything at a DNA/RNA level! But, • Experimentally based functional annotations are sparse; • Most genes play multiple roles and are generally annotated for only one; • Model organisms are phylogenetically quite limited and biased; • …there is little or no $$$ or reputation gain for characterizing novel genes (and nor is it straightforward or easy to do so!)
  • 33. The problem of lopsided gene characterization: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Slide courtesy Erich Schwarz Ref.: Pandey et al. (2014), PLoS One 11, e88889.
  • 34. How do we systematically broaden our functional understanding of genes? 1. More experimental work! • Population studies, perturbation studies, good ol’ fashioned molecular biology, etc. 2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype. 3. Sequence it all and let the bioinformaticians sort it out! What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven investigation, modeling, and exploratory data science. See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  • 35. 3. Accelerating research with better sharing of results, data, methods. Our current journal system is a 20th century solution to a 17th century problem. - Paraphrased from Cameron Neylon (Note: 20th century was LAST century)
  • 36. 3. Accelerating research with better sharing of results, data, methods. We could accelerate research with better sharing. Recent example re rare diseases: http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind- 2 “The current academic publication system does patients an enormous disservice.” – Daniel MacArthur There are many barriers to better communication of results, data, and methods, but most of them are cultural, not technical. (Much harder!)
  • 37. Preprints • Many fields (including bioinformatics and increasingly genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress. • Biology is behind the times! See: 1. Haldane’s Sieve (blog discussion of preprints) 2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)
  • 38. Current model for data sharing In a data limited world, this kind of made sense. Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  • 39. Current model for data sharing This model ignores the fact that data often has multiple (unrealized or serendipitous) uses. (Among many other problems ;) Gather data Interpret data Write & publish paper Grudgingly share as little data as possible
  • 40. The train wreck ahead Gather data Interpret data most data doesn’t get published, Write & publish paper When data is cheap, and interpretation is expensive, and therefore is lost. Grudgingly share as little data as possible (Program managers are not a fan of this)
  • 41. Data sharing challenges - • Little immediate or career rewards for sharing data; incentives are almost entirely punitive (if you DON’T…) • Sharing data in a usable form is still rather difficult. • Submitting data to archival services is, in many cases, surprisingly difficult. • Few methods for gaining recognition for data sharing prior to publication of conclusions.
  • 42. The Ocean Cruise Model One really expensive cruise, many data collectors, shared data. DeepDOM – photo courtesy E. Kujawinski, WHOI
  • 43. Sage Bionetworks / “walled garden” Collaborative data sharing policy with restricted access to outsiders; Central platform with analysis provenance tracking; A model for the future of biomedical research? See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.
  • 44. Distributed cyberinfrastructure to encourage sharing? Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  • 45. Better metadata collection is needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  • 46. Better metadata collection is needed! Suppose the NSA could EITHER track who was calling whom, OR what they were saying – which would be more valuable? Who? What? Who?
  • 47. Better metadata collection is needed! We need to track sample origin, phenotype/environmental conditions, etc. Sample information The –omic data Phenotype This will facilitate discovery, serendipity, re-analysis, and cross-validation.
  • 48. Data and software citation Now methods for: • assigning DOIs to data (which makes it citable) – figshare, dryad. • Data publications – gigascience, SIGS, Scientific Data. • Software citations – Zenodo, MozSciLab/GitHub • Software publications – F1000 Research Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)
  • 49. 4. Exploratory data analysis Old model: Gather data Interpret data
  • 50. New model Your data is most useful when combined with everyone else’s. Gather data Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  • 51. Given enough publicly accessible data… Interpret data Database Database Other people's data Database Database Database Other people's data Other people's data Other people's data Other people's data Other people's data Other people's Database data Database Database Database
  • 52. But: we face lack of training. The lack of training in data science is the biggest challenge facing biology. Students! There’s a great future in data analysis! Also see:
  • 53. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014) Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers. http://www.technologyreview.com/news/53 1091/emtech-illumina-says-228000- human-genomes-will-be-sequenced-this-year/
  • 54. Looking to the future For the senior scientists and funders amongst us, • How do we incentivize data sharing, and training? • How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio? The NIH and NSF are exploring this; the Moore and Sloan foundations are simply doing it (but 1% the size). See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html
  • 56. For Australian students and early career researchers in bioinformatics and computational biology Annual Student Symposium Friday 28th November 2014 Parkville, Victoria Now accepting abstracts for talks and posters Talk abstracts close 31st October combine.org.au