Separation of concerns; multiple implementation possible; when publish, don’t have to talk to anybody to get “your method” integrated; recognition that everything is changing. Embrace chaos.
Integrating large, fast-moving, andheterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University email@example.com
Introduction Background: Modeling & data analysis undergrad => Open source software development + software engineering + developmental biology + genomics PhD => Bio + computer science faculty => Data driven biology Currently working with next-gen sequencing data (mRNAseq, metagenomics, difficult genomes). Thinking hard about how to do data-driven modeling & model-driven data analysis.
Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view.Outline: What types of analysis and discovery do we want to enable? What are the technical challenges, common solutions, and common failure points? Where might we look for success stories, and what lessons can we port to biology? My conclusions.
Specific types of questions “I have a known chemical/gene interaction; do I see it in this other data set?” “I have a known chemical/gene interaction; what other gene expression is affected?” “What does chemical X do to overall phenotype, effect on gene expression, altered protein localization, and patterns of histone modification?” More complex/combinatorial interactions: What does this chemical do in this genetic background? What kind of additional gene expression changes are generated by the combination of these two chemicals? What are common effects of this class of chemicals?
What general behavior do we want toenable? Reuse of data by groups that did not/could not produce it. Publication of reusable/“fork”able data analysis pipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets (“mashups”). Rapid scientific exploration and hypothesis generation in data space.
(Executable papers & data reuse) ENCODE All data is available; all processing scripts for papers are available on a virtual machine. QIIME (microbial ecology) Amazon virtual machine containing software and data for: “Collaborative cloud-enabled tools allow rapid, reproducible biological insights.” (pmid 23096404) Digital normalization paper Amazon virtual machine, again: http://arxiv.org/abs/1203.4802
Executable papers can support easyreplication & reuse of code, data. (IPython Notebook; also see RStudio) http://ged.msu.edu/papers/2012- diginorm/notebook/
What general behavior do we want toenable? Reuse of data by groups that did not/could not produce it. Publication of reusable/”fork”able data analysis pipelines and models. Integration of data and models. Serendipitous uses and cross-referencing of data sets (“mashups”). Rapid scientific exploration and hypothesis generation in data space.
An entertaining digression -- A mashup of Facebook “top 10 books by college” and per-college SAT rank http://booksthatmakeyoudumb.virgil.gr/
Technical obstacles Syntactic incompatibility The first 90% of bioinformatics: your IDs are different from my IDs. Semantic incompatibility The second 90% of bioinformatics: what does “gene” mean in your database? Impedance mismatch SQL is notoriously bad at representing intervals and hierarchies Genomes consist of intervals; ontologies consist of hierarchies! …SQL databases dominate (vs graph or object DBs). Data volume & velocity Large & expanding data sets just make everything harder. Unstructured data aka “publications” – most scientific knowledge is “locked
Typical solutions “Entity resolution” Accession numbers or other common identifiers …requires global naming system OR translators. Top down imposition of structure Centralized DB; “Here is the schema you will all use”; …limits flexibility, prevents use of unstructured data, heavyweight. Ontologies to enable “correct” communication Centrally coordinated vocabulary …slow, hard to get right, doesn’t solve unstructured data problem. Balancing theoretical rigor with practical applicability is particularly hard. Ad hoc entity resolution (“winging it”) Common solution …doesn’t work that well.
Are better standards thesolution? http://xkcd.com/927/
Rephrasing technical goalsHow can we best provide a platform or platforms to support flexible data integration and datainvestigation across a wide range of data sets and data types in biology?My interests: Avoid master data manager and centralization Support federated roll-out of new data and functionality Provide flexible extensibility of ontologies and hierarchies Support diverse “ecology” of databases,
Success stories outside ofbiology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces that support “mashups” and other cross-database stuff, and are intentional, with principles that we can steal or adapt.
Success stories outside ofbiology? Look for domains: with really large amounts of heterogenous data, that are continually increasing in size, are being effectively mined on an ongoing basis, Have widely used programmatic interfaces that support “mashups” and other cross-database stuff, and are intentional, with principles that we can steal or adapt. Amazon.
Amazon: > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services … Continually changing/updating data sets. Explicitly adopted a service-oriented architecture that enables both internal and external use of this data. For example, the amazon.com Web site is itself built from over 150 independent services… Amazon routinely deploys new services and functionality.
Sources:The Platform Rant (Steve Yegge) -- in which hecompares the Google and Amazon approaches:https://plus.google.com/112678702228711889851/posts/eVeouesvaVXA summary at HighScalability.com:http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the first is especially entertaining.)
A brief summary of coreprinciplesMandates from the CEO:1. All teams must expose data and functionality solely through a service interface.2. All communication between teams happens through that service interface.3. All service interfaces must be designed so that they can be exposed to the outside world.
More colloquially: “You should eat your own dogfood.” Design and implement the database and databasefunctionality to meet your own needs; and only use the functionality you’ve explicitly made available to everyone.To adapt to research: database functionality should bedesigned in tightly integration with researchers who are using it, both at a user interface level and programmatically.(Genome databases have done a really good job of this, albeit generally in a centralized model.)
If the “customers” aren’t integratedinto the development loop:
A platform view? Diffn gene Data Metabolic expression exploration model query WWW Gene ID translator Isoform Chemical resolution/ relationships comparison Expression normalization Expression Expression Expression Expression data data data data II (tiling) (microarray) (mRNAseq) (mRNAseq)
A few points Open source and agile software development approaches can be surprisingly effective and inexpensive. Developing services in small groups that include “customer-facing developers” helps ensure utility. Implementing services in the “cloud” (e.g. virtual machines, or on top of “infrastructure as a service” services) gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability.
Combining modelling with data Data-driven modeling: connections and parameters can be, to some extent, determined from data. Model-driven data investigation: data that doesn’t fit the “known” model is particularly interesting.The second approach is essentially how particlephysicists work with accelerator data: build a model &then interpret the data using the model.(In biology, models are less constraining, though; more unknowns.)
Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
Using developmental models Models can contain useful abstractions ofspecific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections.Models provide a common language for (dis)agreement a community.
Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
Social obstacles Training of biologically aware software developers is lacking. Molecular biologists are still very much of a computationally naïve mindset: “give me the answer so I can do the real work” Incentives for data sharing, much less useful data sharing are not yet very strong. Pubs, grants, respect... Patterns for useful data sharing are still not well understood, in general.
Other places to look NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem. SBML (“Systems Biology Markup Language”) is a modeling descriptive language that enables interoperability of modeling software. Software Carpentry runs free workshops on effective use of computation for science.
My conclusions… We need a “platform” mentality to make the most use of our data, even if we don’t completely embrace loose coupling and distribution. Agile and end-user focused software development methodologies have worked well in other areas; much of the hard technical space has already been explored in Internet companies (and probably social networking companies, too). Data is most useful in the context of an explicit model; models can be generated from data, and models can feed back into data gathering.
Things I didn’t discuss Database maintenance and active curation is incredibly important. Most data only makes sense in the context of other data (think: controls; wild type vs knockout; other backgrounds; etc.) – so we will need lots more data to interpret the data we already have. “Deep learning” is a promising field for extracting correlations from multiple large data sets. All of these technical problems are easier to solve than the social problems (incentives; training).
Thanks --This talk and ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/Please do contact me at firstname.lastname@example.org if you have questions or comments.