Integrating large, fast-moving, and
heterogeneous data sets in biology.


              C. Titus Brown
            Asst Prof, CSE and
               Microbiology;
           BEACON NSF STC
         Michigan State University
              ctb@msu.edu
Introduction
 Background:
   Modeling & data analysis undergrad =>
   Open source software development + software
    engineering +
   developmental biology + genomics PhD =>
   Bio + computer science faculty =>
   Data driven biology


 Currently working with next-gen sequencing data
  (mRNAseq, metagenomics, difficult genomes).
 Thinking hard about how to do data-driven
  modeling & model-driven data analysis.
Goal & outline
     Address challenges and opportunities of
   heterogeneous data integration: 1000 ft view.

Outline:
 What types of analysis and discovery do we want
  to enable?
 What are the technical challenges, common
  solutions, and common failure points?
 Where might we look for success stories, and
  what lessons can we port to biology?
 My conclusions.
Specific types of questions
 “I have a known chemical/gene interaction; do I see it
  in this other data set?”
 “I have a known chemical/gene interaction; what other
  gene expression is affected?”
 “What does chemical X do to overall phenotype, effect
  on gene expression, altered protein localization, and
  patterns of histone modification?”
 More complex/combinatorial interactions:
   What does this chemical do in this genetic background?
   What kind of additional gene expression changes are
    generated by the combination of these two chemicals?
   What are common effects of this class of chemicals?
What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
  produce it.

 Publication of reusable/“fork”able data analysis
  pipelines and models.

 Integration of data and models.


 Serendipitous uses and cross-referencing of data sets
  (“mashups”).

 Rapid scientific exploration and hypothesis generation
  in data space.
(Executable papers & data reuse)
 ENCODE
   All data is available; all processing scripts for
   papers are available on a virtual machine.

 QIIME (microbial ecology)
  Amazon virtual machine containing software and data
  for:
  “Collaborative cloud-enabled tools allow rapid,
  reproducible biological insights.” (pmid 23096404)

 Digital normalization paper
  Amazon virtual machine, again:
  http://arxiv.org/abs/1203.4802
Executable papers can support easy
replication & reuse of code, data.


                            (IPython
                            Notebook; also
                            see RStudio)




                     http://ged.msu.edu/papers/2012-
                                  diginorm/notebook/
What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
  produce it.

 Publication of reusable/”fork”able data analysis
  pipelines and models.

 Integration of data and models.


 Serendipitous uses and cross-referencing of data sets
  (“mashups”).

 Rapid scientific exploration and hypothesis generation
  in data space.
An entertaining digression --
  A mashup of Facebook “top 10 books by college” and per-college SAT rank




                                   http://booksthatmakeyoudumb.virgil.gr/
Technical obstacles
 Syntactic incompatibility
   The first 90% of bioinformatics: your IDs are different
    from my IDs.
 Semantic incompatibility
   The second 90% of bioinformatics: what does “gene”
    mean in your database?
 Impedance mismatch
   SQL is notoriously bad at representing intervals and
    hierarchies
   Genomes consist of intervals; ontologies consist of
    hierarchies!
   …SQL databases dominate (vs graph or object DBs).
 Data volume & velocity
   Large & expanding data sets just make everything
    harder.
 Unstructured data
   aka “publications” – most scientific knowledge is “locked
Typical solutions
 “Entity resolution”
    Accession numbers or other common identifiers
  …requires global naming system OR translators.

 Top down imposition of structure
   Centralized DB;
   “Here is the schema you will all use”;
  …limits flexibility, prevents use of unstructured data, heavyweight.

 Ontologies to enable “correct” communication
   Centrally coordinated vocabulary
  …slow, hard to get right, doesn’t solve unstructured data problem.
  Balancing theoretical rigor with practical applicability is particularly
  hard.

 Ad hoc entity resolution (“winging it”)
   Common solution
  …doesn’t work that well.
Are better standards the
solution?




                       http://xkcd.com/927/
Rephrasing technical goals
How can we best provide a platform or platforms to
    support flexible data integration and data
investigation across a wide range of data sets and
               data types in biology?


My interests:
 Avoid master data manager and centralization
 Support federated roll-out of new data and
  functionality
 Provide flexible extensibility of ontologies and
  hierarchies
 Support diverse “ecology” of databases,
Success stories outside of
biology?
 Look for domains:
   with really large amounts of heterogenous data,
   that are continually increasing in size,
   are being effectively mined on an ongoing basis,
   Have widely used programmatic interfaces that
    support “mashups” and other cross-database stuff,
   and are intentional, with principles that we can
    steal or adapt.
Success stories outside of
biology?
 Look for domains:
   with really large amounts of heterogenous data,
   that are continually increasing in size,
   are being effectively mined on an ongoing basis,
   Have widely used programmatic interfaces that
    support “mashups” and other cross-database stuff,
   and are intentional, with principles that we can
    steal or adapt.


                        Amazon.
Amazon:
 > 50 million users, > 1 million product partners,
    billions of reviews, dozens of compute services …
   Continually changing/updating data sets.
   Explicitly adopted a service-oriented architecture
    that enables both internal and external use of this
    data.
   For example, the amazon.com Web site is itself
    built from over 150 independent services…
   Amazon routinely deploys new services and
    functionality.
Sources:
The Platform Rant (Steve Yegge) -- in which he
compares the Google and Amazon approaches:
https://plus.google.com/112678702228711889851/
posts/eVeouesvaVX

A summary at HighScalability.com:
http://highscalability.com/amazon-architecture

 (They are both long and tech-y, note, but the first
            is especially entertaining.)
A brief summary of core
principles
Mandates from the CEO:

1. All teams must expose data and functionality
   solely through a service interface.
2. All communication between teams happens
   through that service interface.
3. All service interfaces must be designed so that
   they can be exposed to the outside world.
More colloquially:
         “You should eat your own dogfood.”

  Design and implement the database and database
functionality to meet your own needs; and only use the
    functionality you’ve explicitly made available to
                       everyone.

To adapt to research: database functionality should be
designed in tightly integration with researchers who are
      using it, both at a user interface level and
                    programmatically.

(Genome databases have done a really good job of this,
      albeit generally in a centralized model.)
If the “customers” aren’t integrated
into the development loop:
A platform view?
                                                  Diff'n gene                Data
                         Metabolic
                                                  expression              exploration
                          model
                                                     query                  WWW




    Gene ID
   translator



                                                                         Isoform
    Chemical                                                           resolution/
  relationships                                                       comparison
                                          Expression
                                         normalization




            Expression          Expression               Expression                  Expression
               data                data                     data                       data II
              (tiling)         (microarray)              (mRNAseq)                   (mRNAseq)
A few points
 Open source and agile software development
 approaches can be surprisingly effective and
 inexpensive.

 Developing services in small groups that include
 “customer-facing developers” helps ensure utility.

 Implementing services in the “cloud” (e.g. virtual
 machines, or on top of “infrastructure as a
 service” services) gives developer flexibility in
 tools, approaches, implementation; also enables
 scaling and reusability.
Combining modelling with data
 Data-driven modeling: connections and parameters
  can be, to some extent, determined from data.

 Model-driven data investigation: data that doesn’t fit
  the “known” model is particularly interesting.

The second approach is essentially how particle
physicists work with accelerator data: build a model &
then interpret the data using the model.

(In biology, models are less constraining, though; more
                      unknowns.)
Using developmental models




             Davidson et al., http://sugp.caltech.edu/endomes
Using developmental models


    Models can contain useful abstractions of
specific processes; here, the direct effects of
 blocking nuclearization of B-catenin can be
   predicted by following the connections.




Models provide a common language for (dis)agreement
                    a community.
Using developmental models




             Davidson et al., http://sugp.caltech.edu/endomes
Social obstacles
 Training of biologically aware software developers
 is lacking.

 Molecular biologists are still very much of a
 computationally naïve mindset: “give me the
 answer so I can do the real work”

 Incentives for data sharing, much less useful
 data sharing are not yet very strong.
   Pubs, grants, respect...


 Patterns for useful data sharing are still not well
 understood, in general.
Other places to look
 NEON and other NSF centers (e.g. NCEAS) are
 collecting vast heterogenous data sets, and are
 explicitly tackling the data
 management/use/integration/reuse problem.

 SBML (“Systems Biology Markup Language”) is a
 modeling descriptive language that enables
 interoperability of modeling software.

 Software Carpentry runs free workshops on
 effective use of computation for science.
My conclusions…
 We need a “platform” mentality to make the most use
  of our data, even if we don’t completely embrace
  loose coupling and distribution.

 Agile and end-user focused software development
  methodologies have worked well in other areas; much
  of the hard technical space has already been
  explored in Internet companies (and probably social
  networking companies, too).

 Data is most useful in the context of an explicit model;
  models can be generated from data, and models can
  feed back into data gathering.
Things I didn’t discuss
 Database maintenance and active curation is
  incredibly important.

 Most data only makes sense in the context of other
  data (think: controls; wild type vs knockout; other
  backgrounds; etc.) – so we will need lots more data to
  interpret the data we already have.

 “Deep learning” is a promising field for extracting
  correlations from multiple large data sets.

 All of these technical problems are easier to solve
  than the social problems (incentives; training).
Thanks --

This talk and ancillary notes will be available on my
                     blog ~soon:
                 http://ivory.idyll.org/blog/

Please do contact me at ctb@msu.edu if you have
            questions or comments.

2013 nas-ehs-data-integration-dc

  • 1.
    Integrating large, fast-moving,and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu
  • 2.
    Introduction  Background:  Modeling & data analysis undergrad =>  Open source software development + software engineering +  developmental biology + genomics PhD =>  Bio + computer science faculty =>  Data driven biology  Currently working with next-gen sequencing data (mRNAseq, metagenomics, difficult genomes).  Thinking hard about how to do data-driven modeling & model-driven data analysis.
  • 3.
    Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view. Outline:  What types of analysis and discovery do we want to enable?  What are the technical challenges, common solutions, and common failure points?  Where might we look for success stories, and what lessons can we port to biology?  My conclusions.
  • 4.
    Specific types ofquestions  “I have a known chemical/gene interaction; do I see it in this other data set?”  “I have a known chemical/gene interaction; what other gene expression is affected?”  “What does chemical X do to overall phenotype, effect on gene expression, altered protein localization, and patterns of histone modification?”  More complex/combinatorial interactions:  What does this chemical do in this genetic background?  What kind of additional gene expression changes are generated by the combination of these two chemicals?  What are common effects of this class of chemicals?
  • 5.
    What general behaviordo we want to enable?  Reuse of data by groups that did not/could not produce it.  Publication of reusable/“fork”able data analysis pipelines and models.  Integration of data and models.  Serendipitous uses and cross-referencing of data sets (“mashups”).  Rapid scientific exploration and hypothesis generation in data space.
  • 6.
    (Executable papers &data reuse)  ENCODE  All data is available; all processing scripts for papers are available on a virtual machine.  QIIME (microbial ecology) Amazon virtual machine containing software and data for: “Collaborative cloud-enabled tools allow rapid, reproducible biological insights.” (pmid 23096404)  Digital normalization paper Amazon virtual machine, again: http://arxiv.org/abs/1203.4802
  • 7.
    Executable papers cansupport easy replication & reuse of code, data. (IPython Notebook; also see RStudio) http://ged.msu.edu/papers/2012- diginorm/notebook/
  • 8.
    What general behaviordo we want to enable?  Reuse of data by groups that did not/could not produce it.  Publication of reusable/”fork”able data analysis pipelines and models.  Integration of data and models.  Serendipitous uses and cross-referencing of data sets (“mashups”).  Rapid scientific exploration and hypothesis generation in data space.
  • 9.
    An entertaining digression-- A mashup of Facebook “top 10 books by college” and per-college SAT rank http://booksthatmakeyoudumb.virgil.gr/
  • 10.
    Technical obstacles  Syntacticincompatibility  The first 90% of bioinformatics: your IDs are different from my IDs.  Semantic incompatibility  The second 90% of bioinformatics: what does “gene” mean in your database?  Impedance mismatch  SQL is notoriously bad at representing intervals and hierarchies  Genomes consist of intervals; ontologies consist of hierarchies!  …SQL databases dominate (vs graph or object DBs).  Data volume & velocity  Large & expanding data sets just make everything harder.  Unstructured data  aka “publications” – most scientific knowledge is “locked
  • 11.
    Typical solutions  “Entityresolution”  Accession numbers or other common identifiers …requires global naming system OR translators.  Top down imposition of structure  Centralized DB;  “Here is the schema you will all use”; …limits flexibility, prevents use of unstructured data, heavyweight.  Ontologies to enable “correct” communication  Centrally coordinated vocabulary …slow, hard to get right, doesn’t solve unstructured data problem. Balancing theoretical rigor with practical applicability is particularly hard.  Ad hoc entity resolution (“winging it”)  Common solution …doesn’t work that well.
  • 12.
    Are better standardsthe solution? http://xkcd.com/927/
  • 13.
    Rephrasing technical goals Howcan we best provide a platform or platforms to support flexible data integration and data investigation across a wide range of data sets and data types in biology? My interests:  Avoid master data manager and centralization  Support federated roll-out of new data and functionality  Provide flexible extensibility of ontologies and hierarchies  Support diverse “ecology” of databases,
  • 14.
    Success stories outsideof biology?  Look for domains:  with really large amounts of heterogenous data,  that are continually increasing in size,  are being effectively mined on an ongoing basis,  Have widely used programmatic interfaces that support “mashups” and other cross-database stuff,  and are intentional, with principles that we can steal or adapt.
  • 15.
    Success stories outsideof biology?  Look for domains:  with really large amounts of heterogenous data,  that are continually increasing in size,  are being effectively mined on an ongoing basis,  Have widely used programmatic interfaces that support “mashups” and other cross-database stuff,  and are intentional, with principles that we can steal or adapt. Amazon.
  • 16.
    Amazon:  > 50million users, > 1 million product partners, billions of reviews, dozens of compute services …  Continually changing/updating data sets.  Explicitly adopted a service-oriented architecture that enables both internal and external use of this data.  For example, the amazon.com Web site is itself built from over 150 independent services…  Amazon routinely deploys new services and functionality.
  • 17.
    Sources: The Platform Rant(Steve Yegge) -- in which he compares the Google and Amazon approaches: https://plus.google.com/112678702228711889851/ posts/eVeouesvaVX A summary at HighScalability.com: http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the first is especially entertaining.)
  • 18.
    A brief summaryof core principles Mandates from the CEO: 1. All teams must expose data and functionality solely through a service interface. 2. All communication between teams happens through that service interface. 3. All service interfaces must be designed so that they can be exposed to the outside world.
  • 19.
    More colloquially: “You should eat your own dogfood.” Design and implement the database and database functionality to meet your own needs; and only use the functionality you’ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tightly integration with researchers who are using it, both at a user interface level and programmatically. (Genome databases have done a really good job of this, albeit generally in a centralized model.)
  • 20.
    If the “customers”aren’t integrated into the development loop:
  • 21.
    A platform view? Diff'n gene Data Metabolic expression exploration model query WWW Gene ID translator Isoform Chemical resolution/ relationships comparison Expression normalization Expression Expression Expression Expression data data data data II (tiling) (microarray) (mRNAseq) (mRNAseq)
  • 22.
    A few points Open source and agile software development approaches can be surprisingly effective and inexpensive.  Developing services in small groups that include “customer-facing developers” helps ensure utility.  Implementing services in the “cloud” (e.g. virtual machines, or on top of “infrastructure as a service” services) gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability.
  • 23.
    Combining modelling withdata  Data-driven modeling: connections and parameters can be, to some extent, determined from data.  Model-driven data investigation: data that doesn’t fit the “known” model is particularly interesting. The second approach is essentially how particle physicists work with accelerator data: build a model & then interpret the data using the model. (In biology, models are less constraining, though; more unknowns.)
  • 25.
    Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
  • 26.
    Using developmental models Models can contain useful abstractions of specific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections. Models provide a common language for (dis)agreement a community.
  • 27.
    Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
  • 28.
    Social obstacles  Trainingof biologically aware software developers is lacking.  Molecular biologists are still very much of a computationally naïve mindset: “give me the answer so I can do the real work”  Incentives for data sharing, much less useful data sharing are not yet very strong.  Pubs, grants, respect...  Patterns for useful data sharing are still not well understood, in general.
  • 29.
    Other places tolook  NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem.  SBML (“Systems Biology Markup Language”) is a modeling descriptive language that enables interoperability of modeling software.  Software Carpentry runs free workshops on effective use of computation for science.
  • 30.
    My conclusions…  Weneed a “platform” mentality to make the most use of our data, even if we don’t completely embrace loose coupling and distribution.  Agile and end-user focused software development methodologies have worked well in other areas; much of the hard technical space has already been explored in Internet companies (and probably social networking companies, too).  Data is most useful in the context of an explicit model; models can be generated from data, and models can feed back into data gathering.
  • 31.
    Things I didn’tdiscuss  Database maintenance and active curation is incredibly important.  Most data only makes sense in the context of other data (think: controls; wild type vs knockout; other backgrounds; etc.) – so we will need lots more data to interpret the data we already have.  “Deep learning” is a promising field for extracting correlations from multiple large data sets.  All of these technical problems are easier to solve than the social problems (incentives; training).
  • 32.
    Thanks -- This talkand ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/ Please do contact me at ctb@msu.edu if you have questions or comments.

Editor's Notes

  • #22 Separation of concerns; multiple implementation possible; when publish, don’t have to talk to anybody to get “your method” integrated; recognition that everything is changing. Embrace chaos.