Emerging challenges in data-intensive genomics

Emerging challenges in data-
intensive genomics
BioFrontiers Symposium
May 28, 2014
Mikael Huss, SciLifeLab / Stockholm
University, Sweden

Where I work
INTEGRATIVEANDTECHNOLOGYDRIVENRESEARCHINHIGH-
THROUGHPUTBIOLOGY

SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
 Inaugurated mid-2010
 Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
 Approximately 700 researchers
 More than 100 researchers in bioinformatics and
systems biology
 Co-directors Prof Mathias Uhlén (KTH), Jan
Andersson (KI), Gunnar von Heijne (SU)

… but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html

Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
Twitter
SciLifeLab King
NYSE
Sanger
Spotify BGI
Facebook
Baidu
Google
Ebay
Internet
World
1e−011e+011e+031e+05

Genomics big data in context: Storage
Data stored (petabytes)
pb
Twitter
SciLifeLab
Spotify
Sanger
Ebay
Facebook
Baidu
NSA
Google
1e−011e+011e+03

Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)

DNA sequencing with and without a reference
Reference-based
Analogous to matching words
and sentence fragments to a
book. Called ”alignment” or
”mapping.”
Algorithmically: Matching
strings to an index.
Reference-free
Analogous to reconstructing a
book from scratch based on
only the words and sentence
fragments. Called (de novo)
”assembly”.
Algorithmically: finding the
best path through a very
complicated graph.

Reference-based example
http://www.slideshare.net/gcoates/next-generation-
genomcs-petascale-data-in-the-life-sciences
Find genetic variants relative to human reference genome

community genomics
Sequencing environmental samples: ocean, soil, etc.
Metagenomics
Continuously monitoring
enviromental DNA
Discovering new bacterial strains, viruses,
antibiotic resistance genes

Metagenomics
Human microbiome
Estimated that there are 3-10 times as
many bacterial cells as human cells in
the body
Also, viruses and bacteriophages

Diagnostics
“NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.html
Storified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/
Joe DeRisi (UCSF)
14-year-old boy came in with various symptoms for which the
underlying problem was hard to diagnose
In the end, took a 1 cubic centimeter brain biopsy and
sequenced on a MiSeq instrument, which identified a pathogen
(leptospira)
Sequencing took ~1 day including lab work, analysis took 1.5 h
Image from Charles Chiu, UCSF
Real-time/streaming bioinformatics needed!

The unknown
http://www.ted.com/talks/nathan_wolfe_wha
t_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate,
less than 1% of the viral
diversity has been explored!

The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
 80% of the 398 billion sequences could not be assembled into putative
genes
 Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!

Hunting viral pathogens
Many academic groups and companies are trying to identify viruses that might
be involved in a variety of diseases in humans and animals.
“Needle in a haystack” problem. Real-life example:
- Sequence human or animal tissue samples. (~30-40 million sequences).
- Filter out host DNA in the computer.
- Try to match rest of sequences to databases of known viruses.
- For whatever is left, assemble sequences de novo and match the assembled
“genes” to “everything” out there (=NCBI’s NT and other databases).
- End up with ~20.000 putative genes that don’t resemble anything in the
databases.

Public data
We realized there is a lot of data online, although scattered around.
Can use the raw or assembled sequences from these studies as part of our own
studies.
Also by combining different data sets and their metadata, we may get clues about
what the unknown things are.
Problems:
1) Sequence comparisons take a long time – need more efficient algorithms.
2) Publicly available data is scattered and disorganized, and much that could be
public isn’t.

Wishlist
- Everybody who is doing metagenomics is finding a lot of unknown stuff!
- Make as much sequence data as possible available
- Build to make all the sequences findable and queryable so that we can
identify commonalities between data sets
- String matching algorithms better adapted for “big-data” use cases in
genomics:
- Real-time (streaming) matching, for diagnostics and environmental
monitoring
- More efficient matching of sequences to huge reference indexes (“every
known sequence”)
- Develop more reference-free methods for discovering new organisms and
genes

Efforts towards these goals
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Jeff Jonas:
“Data finds data”
“The data is the query”
Using the dataset itself, or a statistical
description of it, as a query

Efforts towards these goals
Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency
Helix.io
Genetic classification
startup

Competitions as a way to drive innovation

Competitions as a way to drive innovation
Sage Bionetworks’ competition platform
Can build directly on each other’s code!
SAGE/DREAM breast cancer challenge
Winner of the Innocentive challenge
http://www.newton.ac.uk/programmes/MTG/semin
ars/2014032415301.html
CLARITY challenge
Identifying possible disease causal
genetic variants in three children

Summary
DNA sequencing has great potential for improved diagnostics and pathogen
discovery
We need more efforts in real-time sequence analysis for diagnostics and
monitoring
We need better ways to publish and connect data sets online to enable more
efficient and unbiased discovery
Online collaboration can help both through open data and online competitions
@mikaelhuss
http://followthedata.wordpress.com

Acknowledgements
Research environment
Thomas Svensson + the rest of the WABI group
Joakim Lundeberg + his group members
Helpful comments
Petter Holme
Stefania Giacomello
Mattias Andersson
Metagenomics discussions
Anders Andersson + group
Hilja Strid
Joakim Larsson
Johan Bengtsson-Palme
+ the readers of my blog and all the data enthusiasts in Stockholm and elsewhere!
@mikaelhuss
http://followthedata.wordpress.com

Why hasn’t Hadoop caught on in genomics?
Hadoop is almost synonymous with big data in the corporate world
Ideas:
– Existing computing infrastructure is sufficient
– Or, focused on supercomputing solutions rather than commodity
servers
– The programming skills and training are not there
– Many problems not parallelisable
– Not enough flexibility for exploratory analysis

Emerging challenges in data-intensive genomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Emerging challenges in data-intensive genomics

Similar to Emerging challenges in data-intensive genomics (20)

Recently uploaded

Recently uploaded (20)

Emerging challenges in data-intensive genomics