SlideShare a Scribd company logo
1 of 27
Emerging challenges in data-
intensive genomics
BioFrontiers Symposium
May 28, 2014
Mikael Huss, SciLifeLab / Stockholm
University, Sweden
Where I work
INTEGRATIVEANDTECHNOLOGYDRIVENRESEARCHINHIGH-
THROUGHPUTBIOLOGY
SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
 Inaugurated mid-2010
 Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
 Approximately 700 researchers
 More than 100 researchers in bioinformatics and
systems biology
 Co-directors Prof Mathias Uhlén (KTH), Jan
Andersson (KI), Gunnar von Heijne (SU)
Big is relative
… but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html
Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
Twitter
SciLifeLab King
NYSE
Sanger
Spotify BGI
Facebook
Baidu
Google
Ebay
Internet
World
1e−011e+011e+031e+05
Genomics big data in context: Storage
Data stored (petabytes)
pb
Twitter
SciLifeLab
Spotify
Sanger
Ebay
Facebook
Baidu
NSA
Google
1e−011e+011e+03
Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)
DNA sequencing with and without a reference
Reference-based
Analogous to matching words
and sentence fragments to a
book. Called ”alignment” or
”mapping.”
Algorithmically: Matching
strings to an index.
Reference-free
Analogous to reconstructing a
book from scratch based on
only the words and sentence
fragments. Called (de novo)
”assembly”.
Algorithmically: finding the
best path through a very
complicated graph.
Reference-based example
http://www.slideshare.net/gcoates/next-generation-
genomcs-petascale-data-in-the-life-sciences
Find genetic variants relative to human reference genome
community genomics
Sequencing environmental samples: ocean, soil, etc.
Metagenomics
Continuously monitoring
enviromental DNA
Discovering new bacterial strains, viruses,
antibiotic resistance genes
Metagenomics
Human microbiome
Estimated that there are 3-10 times as
many bacterial cells as human cells in
the body
Also, viruses and bacteriophages
Diagnostics
“NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.html
Storified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/
Joe DeRisi (UCSF)
14-year-old boy came in with various symptoms for which the
underlying problem was hard to diagnose
In the end, took a 1 cubic centimeter brain biopsy and
sequenced on a MiSeq instrument, which identified a pathogen
(leptospira)
Sequencing took ~1 day including lab work, analysis took 1.5 h
Image from Charles Chiu, UCSF
Real-time/streaming bioinformatics needed!
The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!
The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
 80% of the 398 billion sequences could not be assembled into putative
genes
 Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!
Hunting viral pathogens
Many academic groups and companies are trying to identify viruses that might
be involved in a variety of diseases in humans and animals.
“Needle in a haystack” problem. Real-life example:
- Sequence human or animal tissue samples. (~30-40 million sequences).
- Filter out host DNA in the computer.
- Try to match rest of sequences to databases of known viruses.
- For whatever is left, assemble sequences de novo and match the assembled
“genes” to “everything” out there (=NCBI’s NT and other databases).
- End up with ~20.000 putative genes that don’t resemble anything in the
databases.
Public data
We realized there is a lot of data online, although scattered around.
Can use the raw or assembled sequences from these studies as part of our own
studies.
Also by combining different data sets and their metadata, we may get clues about
what the unknown things are.
Problems:
1) Sequence comparisons take a long time – need more efficient algorithms.
2) Publicly available data is scattered and disorganized, and much that could be
public isn’t.
Wishlist
- Everybody who is doing metagenomics is finding a lot of unknown stuff!
- Make as much sequence data as possible available
- Build to make all the sequences findable and queryable so that we can
identify commonalities between data sets
- String matching algorithms better adapted for “big-data” use cases in
genomics:
- Real-time (streaming) matching, for diagnostics and environmental
monitoring
- More efficient matching of sequences to huge reference indexes (“every
known sequence”)
- Develop more reference-free methods for discovering new organisms and
genes
Efforts towards these goals
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Jeff Jonas:
“Data finds data”
“The data is the query”
Using the dataset itself, or a statistical
description of it, as a query
Efforts towards these goals
Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency
Helix.io
Genetic classification
startup
Competitions as a way to drive innovation
Competitions as a way to drive innovation
Competitions as a way to drive innovation
Sage Bionetworks’ competition platform
Can build directly on each other’s code!
SAGE/DREAM breast cancer challenge
Winner of the Innocentive challenge
http://www.newton.ac.uk/programmes/MTG/semin
ars/2014032415301.html
CLARITY challenge
Identifying possible disease causal
genetic variants in three children
Summary
DNA sequencing has great potential for improved diagnostics and pathogen
discovery
We need more efforts in real-time sequence analysis for diagnostics and
monitoring
We need better ways to publish and connect data sets online to enable more
efficient and unbiased discovery
Online collaboration can help both through open data and online competitions
@mikaelhuss
http://followthedata.wordpress.com
Acknowledgements
Research environment
Thomas Svensson + the rest of the WABI group
Joakim Lundeberg + his group members
Helpful comments
Petter Holme
Stefania Giacomello
Mattias Andersson
Metagenomics discussions
Anders Andersson + group
Hilja Strid
Joakim Larsson
Johan Bengtsson-Palme
+ the readers of my blog and all the data enthusiasts in Stockholm and elsewhere!
@mikaelhuss
http://followthedata.wordpress.com
Extra slides
Why hasn’t Hadoop caught on in genomics?
Hadoop is almost synonymous with big data in the corporate world
Ideas:
– Existing computing infrastructure is sufficient
– Or, focused on supercomputing solutions rather than commodity
servers
– The programming skills and training are not there
– Many problems not parallelisable
– Not enough flexibility for exploratory analysis

More Related Content

What's hot

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsDenis C. Bauer
 
Introduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesIntroduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesSaide OER Africa
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final ReportShruthi Choudary
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentationSKUAST-Kashmir
 
Database technologies in bioinformatics
Database technologies in bioinformaticsDatabase technologies in bioinformatics
Database technologies in bioinformaticsGleb Sklyr
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Databasebigdatabm
 
Bioinformatics Databases
Bioinformatics DatabasesBioinformatics Databases
Bioinformatics Databasescschlos2
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
Basics in bioinformatics
Basics in bioinformaticsBasics in bioinformatics
Basics in bioinformaticsMamun Billah
 
Careers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and JobsCareers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and JobsM Abdullah Chaudhry
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_IntroAbhiroop Ghatak
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsJTADrexel
 

What's hot (20)

Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Introduction to Bioinformatics Slides
Introduction to Bioinformatics SlidesIntroduction to Bioinformatics Slides
Introduction to Bioinformatics Slides
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
Database technologies in bioinformatics
Database technologies in bioinformaticsDatabase technologies in bioinformatics
Database technologies in bioinformatics
 
Bioinformatics ppt
Bioinformatics pptBioinformatics ppt
Bioinformatics ppt
 
Bio Informatics
Bio InformaticsBio Informatics
Bio Informatics
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
Bioinformatics Databases
Bioinformatics DatabasesBioinformatics Databases
Bioinformatics Databases
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Basics in bioinformatics
Basics in bioinformaticsBasics in bioinformatics
Basics in bioinformatics
 
Careers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and JobsCareers in bioinformatics, Scope, Skills and Jobs
Careers in bioinformatics, Scope, Skills and Jobs
 
Bioinformatics-General_Intro
Bioinformatics-General_IntroBioinformatics-General_Intro
Bioinformatics-General_Intro
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics in a Nutshell
Bioinformatics in a NutshellBioinformatics in a Nutshell
Bioinformatics in a Nutshell
 
Bioinformatics in present and its future
Bioinformatics in present and its futureBioinformatics in present and its future
Bioinformatics in present and its future
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

Viewers also liked

Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq datamikaelhuss
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg projectsree navya
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsBITS
 
Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemscursoNGS
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNAcursoNGS
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6BITS
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsBITS
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanacursoNGS
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsBITS
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3BITS
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-SeqcursoNGS
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsBITS
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS
 

Viewers also liked (20)

Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq data
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg project
 
BITS - Search engines for mass spec data
BITS - Search engines for mass spec dataBITS - Search engines for mass spec data
BITS - Search engines for mass spec data
 
Introduction to Linux for bioinformatics
Introduction to Linux for bioinformaticsIntroduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
 
Towards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systemsTowards an understanding of diversity in biological and biomedical systems
Towards an understanding of diversity in biological and biomedical systems
 
NGS analysis of micro-RNA
NGS analysis of micro-RNANGS analysis of micro-RNA
NGS analysis of micro-RNA
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
BITS - Introduction to comparative genomics
BITS - Introduction to comparative genomicsBITS - Introduction to comparative genomics
BITS - Introduction to comparative genomics
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
Utilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humanaUtilidad de la genómica en la salud humana
Utilidad de la genómica en la salud humana
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformatics
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 

Similar to Emerging challenges in data-intensive genomics

Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataChirag Patel
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesBarbera van Schaik
 
Open Science at Genome Scale
Open Science at Genome ScaleOpen Science at Genome Scale
Open Science at Genome ScaleLizLyon
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data DriverLarry Smarr
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
 
Itqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateItqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateHelena Deus
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Artificial Intelligence Institute at UofSC
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
The Human Cell Atlas Data Coordination Platform
The Human Cell Atlas Data Coordination PlatformThe Human Cell Atlas Data Coordination Platform
The Human Cell Atlas Data Coordination PlatformLaura Clarke
 
01. Introduction to Bioinformatics.pptx
01. Introduction to Bioinformatics.pptx01. Introduction to Bioinformatics.pptx
01. Introduction to Bioinformatics.pptxHussainTaqi1
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
 

Similar to Emerging challenges in data-intensive genomics (20)

Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Methods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big dataMethods to enhance the validity of precision guidelines emerging from big data
Methods to enhance the validity of precision guidelines emerging from big data
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differences
 
Open Science at Genome Scale
Open Science at Genome ScaleOpen Science at Genome Scale
Open Science at Genome Scale
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data Driver
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
Itqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplateItqb talkslideshfd deritemplate
Itqb talkslideshfd deritemplate
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
The Human Cell Atlas Data Coordination Platform
The Human Cell Atlas Data Coordination PlatformThe Human Cell Atlas Data Coordination Platform
The Human Cell Atlas Data Coordination Platform
 
01. Introduction to Bioinformatics.pptx
01. Introduction to Bioinformatics.pptx01. Introduction to Bioinformatics.pptx
01. Introduction to Bioinformatics.pptx
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 

Emerging challenges in data-intensive genomics

  • 1. Emerging challenges in data- intensive genomics BioFrontiers Symposium May 28, 2014 Mikael Huss, SciLifeLab / Stockholm University, Sweden
  • 3. SciLifeLab – an infrastructure for massive biology Science 328,805 (14 May 2010)  Inaugurated mid-2010  Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.  Approximately 700 researchers  More than 100 researchers in bioinformatics and systems biology  Co-directors Prof Mathias Uhlén (KTH), Jan Andersson (KI), Gunnar von Heijne (SU)
  • 5. … but some people are willing to go out on a limb “Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5- 10TB of structured data, which cannot be reduced any further” (OLRAC SPS) http://www.itweb.co.za/index.php?option=com_con tent&view=article&id=111815 ”There is no such thing as biomedical big data” (Will Bush, Vanderbilt University Center for Human Genetic Research) http://gettinggeneticsdone.blogspot.se/2014/02/no- such-thing-biomedical-bigdata.html
  • 6. Genomics big data in context: Throughput Data processed per day (terabytes) Tb Twitter SciLifeLab King NYSE Sanger Spotify BGI Facebook Baidu Google Ebay Internet World 1e−011e+011e+031e+05
  • 7. Genomics big data in context: Storage Data stored (petabytes) pb Twitter SciLifeLab Spotify Sanger Ebay Facebook Baidu NSA Google 1e−011e+011e+03
  • 8. Genomics big data in context: Heterogeneity “The size of the data is not the whole story. If the data are uniform, they can almost always be compressed and filtered with traditional methods. You do not get a ‘big data’ processing challenge until other factors, such as variety, non- uniformity and continuous growth, are added to a large data set.” (adapted from Aleksi Kallio)
  • 9. DNA sequencing with and without a reference Reference-based Analogous to matching words and sentence fragments to a book. Called ”alignment” or ”mapping.” Algorithmically: Matching strings to an index. Reference-free Analogous to reconstructing a book from scratch based on only the words and sentence fragments. Called (de novo) ”assembly”. Algorithmically: finding the best path through a very complicated graph.
  • 11. community genomics Sequencing environmental samples: ocean, soil, etc. Metagenomics Continuously monitoring enviromental DNA Discovering new bacterial strains, viruses, antibiotic resistance genes
  • 12. Metagenomics Human microbiome Estimated that there are 3-10 times as many bacterial cells as human cells in the body Also, viruses and bacteriophages
  • 13. Diagnostics “NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.html Storified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/ Joe DeRisi (UCSF) 14-year-old boy came in with various symptoms for which the underlying problem was hard to diagnose In the end, took a 1 cubic centimeter brain biopsy and sequenced on a MiSeq instrument, which identified a pathogen (leptospira) Sequencing took ~1 day including lab work, analysis took 1.5 h Image from Charles Chiu, UCSF Real-time/streaming bioinformatics needed!
  • 14. The unknown http://www.ted.com/talks/nathan_wolfe_wha t_s_left_to_explore.html “Biological dark matter” “The unknown continent” According to one estimate, less than 1% of the viral diversity has been explored!
  • 15. The unknown In a recent paper on soil metagenomics, Titus Brown and colleagues report that:  80% of the 398 billion sequences could not be assembled into putative genes  Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!
  • 16. Hunting viral pathogens Many academic groups and companies are trying to identify viruses that might be involved in a variety of diseases in humans and animals. “Needle in a haystack” problem. Real-life example: - Sequence human or animal tissue samples. (~30-40 million sequences). - Filter out host DNA in the computer. - Try to match rest of sequences to databases of known viruses. - For whatever is left, assemble sequences de novo and match the assembled “genes” to “everything” out there (=NCBI’s NT and other databases). - End up with ~20.000 putative genes that don’t resemble anything in the databases.
  • 17. Public data We realized there is a lot of data online, although scattered around. Can use the raw or assembled sequences from these studies as part of our own studies. Also by combining different data sets and their metadata, we may get clues about what the unknown things are. Problems: 1) Sequence comparisons take a long time – need more efficient algorithms. 2) Publicly available data is scattered and disorganized, and much that could be public isn’t.
  • 18. Wishlist - Everybody who is doing metagenomics is finding a lot of unknown stuff! - Make as much sequence data as possible available - Build to make all the sequences findable and queryable so that we can identify commonalities between data sets - String matching algorithms better adapted for “big-data” use cases in genomics: - Real-time (streaming) matching, for diagnostics and environmental monitoring - More efficient matching of sequences to huge reference indexes (“every known sequence”) - Develop more reference-free methods for discovering new organisms and genes
  • 19. Efforts towards these goals “we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html) Jeff Jonas: “Data finds data” “The data is the query” Using the dataset itself, or a statistical description of it, as a query
  • 20. Efforts towards these goals Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency Helix.io Genetic classification startup
  • 21. Competitions as a way to drive innovation
  • 22. Competitions as a way to drive innovation
  • 23. Competitions as a way to drive innovation Sage Bionetworks’ competition platform Can build directly on each other’s code! SAGE/DREAM breast cancer challenge Winner of the Innocentive challenge http://www.newton.ac.uk/programmes/MTG/semin ars/2014032415301.html CLARITY challenge Identifying possible disease causal genetic variants in three children
  • 24. Summary DNA sequencing has great potential for improved diagnostics and pathogen discovery We need more efforts in real-time sequence analysis for diagnostics and monitoring We need better ways to publish and connect data sets online to enable more efficient and unbiased discovery Online collaboration can help both through open data and online competitions @mikaelhuss http://followthedata.wordpress.com
  • 25. Acknowledgements Research environment Thomas Svensson + the rest of the WABI group Joakim Lundeberg + his group members Helpful comments Petter Holme Stefania Giacomello Mattias Andersson Metagenomics discussions Anders Andersson + group Hilja Strid Joakim Larsson Johan Bengtsson-Palme + the readers of my blog and all the data enthusiasts in Stockholm and elsewhere! @mikaelhuss http://followthedata.wordpress.com
  • 27. Why hasn’t Hadoop caught on in genomics? Hadoop is almost synonymous with big data in the corporate world Ideas: – Existing computing infrastructure is sufficient – Or, focused on supercomputing solutions rather than commodity servers – The programming skills and training are not there – Many problems not parallelisable – Not enough flexibility for exploratory analysis