2013 nas-ehs-data-integration-dc

Integrating large, fast-moving, and
heterogeneous data sets in biology.

C. Titus Brown
Asst Prof, CSE and
Microbiology;
BEACON NSF STC
Michigan State University
ctb@msu.edu

Introduction
 Background:
 Modeling & data analysis undergrad =>
 Open source software development + software
engineering +
 developmental biology + genomics PhD =>
 Bio + computer science faculty =>
 Data driven biology

 Currently working with next-gen sequencing data
(mRNAseq, metagenomics, difficult genomes).
 Thinking hard about how to do data-driven
modeling & model-driven data analysis.

Goal & outline
Address challenges and opportunities of
heterogeneous data integration: 1000 ft view.

Outline:
 What types of analysis and discovery do we want
to enable?
 What are the technical challenges, common
solutions, and common failure points?
 Where might we look for success stories, and
what lessons can we port to biology?
 My conclusions.

Specific types of questions
 “I have a known chemical/gene interaction; do I see it
in this other data set?”
 “I have a known chemical/gene interaction; what other
gene expression is affected?”
 “What does chemical X do to overall phenotype, effect
on gene expression, altered protein localization, and
patterns of histone modification?”
 More complex/combinatorial interactions:
 What does this chemical do in this genetic background?
 What kind of additional gene expression changes are
generated by the combination of these two chemicals?
 What are common effects of this class of chemicals?

What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
produce it.

 Publication of reusable/“fork”able data analysis
pipelines and models.

 Integration of data and models.

 Serendipitous uses and cross-referencing of data sets
(“mashups”).

 Rapid scientific exploration and hypothesis generation
in data space.

(Executable papers & data reuse)
 ENCODE
 All data is available; all processing scripts for
papers are available on a virtual machine.

 QIIME (microbial ecology)
Amazon virtual machine containing software and data
for:
“Collaborative cloud-enabled tools allow rapid,
reproducible biological insights.” (pmid 23096404)

 Digital normalization paper
Amazon virtual machine, again:
http://arxiv.org/abs/1203.4802

Executable papers can support easy
replication & reuse of code, data.

(IPython
Notebook; also
see RStudio)

http://ged.msu.edu/papers/2012-
diginorm/notebook/

What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
produce it.

 Publication of reusable/”fork”able data analysis
pipelines and models.

 Integration of data and models.

 Serendipitous uses and cross-referencing of data sets
(“mashups”).

 Rapid scientific exploration and hypothesis generation
in data space.

An entertaining digression --
A mashup of Facebook “top 10 books by college” and per-college SAT rank

http://booksthatmakeyoudumb.virgil.gr/

Technical obstacles
 Syntactic incompatibility
 The first 90% of bioinformatics: your IDs are different
from my IDs.
 Semantic incompatibility
 The second 90% of bioinformatics: what does “gene”
mean in your database?
 Impedance mismatch
 SQL is notoriously bad at representing intervals and
hierarchies
 Genomes consist of intervals; ontologies consist of
hierarchies!
 …SQL databases dominate (vs graph or object DBs).
 Data volume & velocity
 Large & expanding data sets just make everything
harder.
 Unstructured data
 aka “publications” – most scientific knowledge is “locked

Typical solutions
 “Entity resolution”
 Accession numbers or other common identifiers
…requires global naming system OR translators.

 Top down imposition of structure
 Centralized DB;
 “Here is the schema you will all use”;
…limits flexibility, prevents use of unstructured data, heavyweight.

 Ontologies to enable “correct” communication
 Centrally coordinated vocabulary
…slow, hard to get right, doesn’t solve unstructured data problem.
Balancing theoretical rigor with practical applicability is particularly
hard.

 Ad hoc entity resolution (“winging it”)
 Common solution
…doesn’t work that well.

Are better standards the
solution?

http://xkcd.com/927/

Rephrasing technical goals
How can we best provide a platform or platforms to
support flexible data integration and data
investigation across a wide range of data sets and
data types in biology?

My interests:
 Avoid master data manager and centralization
 Support federated roll-out of new data and
functionality
 Provide flexible extensibility of ontologies and
hierarchies
 Support diverse “ecology” of databases,

Success stories outside of
biology?
 Look for domains:
 with really large amounts of heterogenous data,
 that are continually increasing in size,
 are being effectively mined on an ongoing basis,
 Have widely used programmatic interfaces that
support “mashups” and other cross-database stuff,
 and are intentional, with principles that we can
steal or adapt.

Success stories outside of
biology?
 Look for domains:
 with really large amounts of heterogenous data,
 that are continually increasing in size,
 are being effectively mined on an ongoing basis,
 Have widely used programmatic interfaces that
support “mashups” and other cross-database stuff,
 and are intentional, with principles that we can
steal or adapt.

Amazon.

Amazon:
 > 50 million users, > 1 million product partners,
billions of reviews, dozens of compute services …
 Continually changing/updating data sets.
 Explicitly adopted a service-oriented architecture
that enables both internal and external use of this
data.
 For example, the amazon.com Web site is itself
built from over 150 independent services…
 Amazon routinely deploys new services and
functionality.

Sources:
The Platform Rant (Steve Yegge) -- in which he
compares the Google and Amazon approaches:
https://plus.google.com/112678702228711889851/
posts/eVeouesvaVX

A summary at HighScalability.com:
http://highscalability.com/amazon-architecture

(They are both long and tech-y, note, but the first
is especially entertaining.)

A brief summary of core
principles
Mandates from the CEO:

1. All teams must expose data and functionality
solely through a service interface.
2. All communication between teams happens
through that service interface.
3. All service interfaces must be designed so that
they can be exposed to the outside world.

More colloquially:
“You should eat your own dogfood.”

Design and implement the database and database
functionality to meet your own needs; and only use the
functionality you’ve explicitly made available to
everyone.

To adapt to research: database functionality should be
designed in tightly integration with researchers who are
using it, both at a user interface level and
programmatically.

(Genome databases have done a really good job of this,
albeit generally in a centralized model.)

If the “customers” aren’t integrated
into the development loop:

A platform view?
Diff'n gene Data
Metabolic
expression exploration
model
query WWW

Gene ID
translator

Isoform
Chemical resolution/
relationships comparison
Expression
normalization

Expression Expression Expression Expression
data data data data II
(tiling) (microarray) (mRNAseq) (mRNAseq)

A few points
 Open source and agile software development
approaches can be surprisingly effective and
inexpensive.

 Developing services in small groups that include
“customer-facing developers” helps ensure utility.

 Implementing services in the “cloud” (e.g. virtual
machines, or on top of “infrastructure as a
service” services) gives developer flexibility in
tools, approaches, implementation; also enables
scaling and reusability.

Combining modelling with data
 Data-driven modeling: connections and parameters
can be, to some extent, determined from data.

 Model-driven data investigation: data that doesn’t fit
the “known” model is particularly interesting.

The second approach is essentially how particle
physicists work with accelerator data: build a model &
then interpret the data using the model.

(In biology, models are less constraining, though; more
unknowns.)

Using developmental models

Davidson et al., http://sugp.caltech.edu/endomes

Using developmental models

Models can contain useful abstractions of
specific processes; here, the direct effects of
blocking nuclearization of B-catenin can be
predicted by following the connections.

Models provide a common language for (dis)agreement
a community.

Social obstacles
 Training of biologically aware software developers
is lacking.

 Molecular biologists are still very much of a
computationally naïve mindset: “give me the
answer so I can do the real work”

 Incentives for data sharing, much less useful
data sharing are not yet very strong.
 Pubs, grants, respect...

 Patterns for useful data sharing are still not well
understood, in general.

Other places to look
 NEON and other NSF centers (e.g. NCEAS) are
collecting vast heterogenous data sets, and are
explicitly tackling the data
management/use/integration/reuse problem.

 SBML (“Systems Biology Markup Language”) is a
modeling descriptive language that enables
interoperability of modeling software.

 Software Carpentry runs free workshops on
effective use of computation for science.

My conclusions…
 We need a “platform” mentality to make the most use
of our data, even if we don’t completely embrace
loose coupling and distribution.

 Agile and end-user focused software development
methodologies have worked well in other areas; much
of the hard technical space has already been
explored in Internet companies (and probably social
networking companies, too).

 Data is most useful in the context of an explicit model;
models can be generated from data, and models can
feed back into data gathering.

Things I didn’t discuss
 Database maintenance and active curation is
incredibly important.

 Most data only makes sense in the context of other
data (think: controls; wild type vs knockout; other
backgrounds; etc.) – so we will need lots more data to
interpret the data we already have.

 “Deep learning” is a promising field for extracting
correlations from multiple large data sets.

 All of these technical problems are easier to solve
than the social problems (incentives; training).

Thanks --

This talk and ancillary notes will be available on my
blog ~soon:
http://ivory.idyll.org/blog/

Please do contact me at ctb@msu.edu if you have
questions or comments.

2013 nas-ehs-data-integration-dc

More Related Content

What's hot

Viewers also liked

Similar to 2013 nas-ehs-data-integration-dc

More from c.titus.brown

2013 nas-ehs-data-integration-dc

Editor's Notes