3. SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
Inaugurated mid-2010
Hosted by three universities in Stockholm:
Karolinska Institutet (medical faculty), Royal Institute
of Technology (technical) and Stockholm University
(natural science). SciLifeLab node in Uppsala.
Approximately 700 researchers
More than 100 researchers in bioinformatics and
systems biology
Co-directors Prof Mathias Uhlén (KTH), Jan
Andersson (KI), Gunnar von Heijne (SU)
5. … but some people are willing to go out on a limb
“Where is the cut-off? The
line in the sand is 5TB of
unstructured data or 7.5-
10TB of structured data,
which cannot be reduced
any further”
(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_con
tent&view=article&id=111815
”There is no such thing as
biomedical big data”
(Will Bush, Vanderbilt
University Center for
Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-
such-thing-biomedical-bigdata.html
6. Genomics big data in context: Throughput
Data processed per day (terabytes)
Tb
Twitter
SciLifeLab King
NYSE
Sanger
Spotify BGI
Facebook
Baidu
Google
Ebay
Internet
World
1e−011e+011e+031e+05
7. Genomics big data in context: Storage
Data stored (petabytes)
pb
Twitter
SciLifeLab
Spotify
Sanger
Ebay
Facebook
Baidu
NSA
Google
1e−011e+011e+03
8. Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always
be compressed and filtered with traditional
methods.
You do not get a ‘big data’ processing challenge
until other factors, such as variety, non-
uniformity and continuous growth, are added to
a large data set.”
(adapted from Aleksi Kallio)
9. DNA sequencing with and without a reference
Reference-based
Analogous to matching words
and sentence fragments to a
book. Called ”alignment” or
”mapping.”
Algorithmically: Matching
strings to an index.
Reference-free
Analogous to reconstructing a
book from scratch based on
only the words and sentence
fragments. Called (de novo)
”assembly”.
Algorithmically: finding the
best path through a very
complicated graph.
13. Diagnostics
“NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.html
Storified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/
Joe DeRisi (UCSF)
14-year-old boy came in with various symptoms for which the
underlying problem was hard to diagnose
In the end, took a 1 cubic centimeter brain biopsy and
sequenced on a MiSeq instrument, which identified a pathogen
(leptospira)
Sequencing took ~1 day including lab work, analysis took 1.5 h
Image from Charles Chiu, UCSF
Real-time/streaming bioinformatics needed!
15. The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
80% of the 398 billion sequences could not be assembled into putative
genes
Of the cases where sequences could be assembled into putative genes which
would create putative proteins, 60% of these proteins could not be matched
to anything in the databases!
16. Hunting viral pathogens
Many academic groups and companies are trying to identify viruses that might
be involved in a variety of diseases in humans and animals.
“Needle in a haystack” problem. Real-life example:
- Sequence human or animal tissue samples. (~30-40 million sequences).
- Filter out host DNA in the computer.
- Try to match rest of sequences to databases of known viruses.
- For whatever is left, assemble sequences de novo and match the assembled
“genes” to “everything” out there (=NCBI’s NT and other databases).
- End up with ~20.000 putative genes that don’t resemble anything in the
databases.
17. Public data
We realized there is a lot of data online, although scattered around.
Can use the raw or assembled sequences from these studies as part of our own
studies.
Also by combining different data sets and their metadata, we may get clues about
what the unknown things are.
Problems:
1) Sequence comparisons take a long time – need more efficient algorithms.
2) Publicly available data is scattered and disorganized, and much that could be
public isn’t.
18. Wishlist
- Everybody who is doing metagenomics is finding a lot of unknown stuff!
- Make as much sequence data as possible available
- Build to make all the sequences findable and queryable so that we can
identify commonalities between data sets
- String matching algorithms better adapted for “big-data” use cases in
genomics:
- Real-time (streaming) matching, for diagnostics and environmental
monitoring
- More efficient matching of sequences to huge reference indexes (“every
known sequence”)
- Develop more reference-free methods for discovering new organisms and
genes
19. Efforts towards these goals
“we want to support automated data exploration in ways that are simply not possible today”
C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Jeff Jonas:
“Data finds data”
“The data is the query”
Using the dataset itself, or a statistical
description of it, as a query
20. Efforts towards these goals
Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency
Helix.io
Genetic classification
startup
23. Competitions as a way to drive innovation
Sage Bionetworks’ competition platform
Can build directly on each other’s code!
SAGE/DREAM breast cancer challenge
Winner of the Innocentive challenge
http://www.newton.ac.uk/programmes/MTG/semin
ars/2014032415301.html
CLARITY challenge
Identifying possible disease causal
genetic variants in three children
24. Summary
DNA sequencing has great potential for improved diagnostics and pathogen
discovery
We need more efforts in real-time sequence analysis for diagnostics and
monitoring
We need better ways to publish and connect data sets online to enable more
efficient and unbiased discovery
Online collaboration can help both through open data and online competitions
@mikaelhuss
http://followthedata.wordpress.com
25. Acknowledgements
Research environment
Thomas Svensson + the rest of the WABI group
Joakim Lundeberg + his group members
Helpful comments
Petter Holme
Stefania Giacomello
Mattias Andersson
Metagenomics discussions
Anders Andersson + group
Hilja Strid
Joakim Larsson
Johan Bengtsson-Palme
+ the readers of my blog and all the data enthusiasts in Stockholm and elsewhere!
@mikaelhuss
http://followthedata.wordpress.com
27. Why hasn’t Hadoop caught on in genomics?
Hadoop is almost synonymous with big data in the corporate world
Ideas:
– Existing computing infrastructure is sufficient
– Or, focused on supercomputing solutions rather than commodity
servers
– The programming skills and training are not there
– Many problems not parallelisable
– Not enough flexibility for exploratory analysis