This document discusses challenges and opportunities for integrating large, heterogeneous biological data sets. It outlines the types of analysis and discovery that could be enabled, such as comparing data across studies. Technical challenges include incompatible identifiers and schemas between data sources. Common solutions attempt standardization but have limitations. The document examines Amazon's approach as a model, with principles like exposing all data through programmatic interfaces. It argues for a "platform" approach and combining data-driven and model-driven analysis to gain new insights. Developing services with end users in mind could help maximize data reuse.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
This presentation described how data-driven chemoinformatics methods may automate much of what has historically been done by a medicinal chemist. It explored what is reasonable to expect “AI” approaches might achieve, and what is best left with a human expert. The implications of automation for the human-machine interface were explored and illustrated with examples from Bradshaw, GSK’s experimental automated design environment.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Bioinformatics may be defined as the field of science
in which biology, computer science, and information
technology merge to form a single discipline. Its ultimate
goal is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying
principles in biology can be discerned by means of
bioinformatics tools for storing, retrieving, organizing and
analyzing biological data. Also most of these tools possess
very distinct features and capabilities making a direct
comparison difficult to be done. In this paper we propose
taxonomy for characterizing bioinformatics tools and briefly
surveys major bioinformatics tools under each categories.
Hopefully this study will stimulate other designers
and
experienced end users understand the details of particular
tool categories/tools, enabling them to make the best choices
for their particular research interests.
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
Presentation by Ola Spjuth (Uppsala University and Scaleout) at the Chemical Biology Seminar Series, February 6th, at Karolinska Institutet and Science for Life Laboratory, Stockholm, Sweden.
ABSTRACT
Phenotypic profiling of cells with high-content imaging is emerging as an important methodology with high predictive power. The true power of these methods comes when integrated into automated, robotized systems that can be run continuously and not restricted to batch analysis. One of the main challenges then becomes how to manage and continuously analyze the large amounts of data produced. In this talk I will present our efforts to establish an automated lab for cell profiling of drugs using multiplexed fluorescence imaging (Cell Painting). I will describe our computational and lab infrastructure as well as the systems, tools an methods we are developing to sustain continuous profiling of cells and continuous AI modeling. A key objective in the group is on improving screening and toxicity assessment, but also to explore predictions of mechanisms and pathways. The long-term goal is to build a closed-loop system where results from analyses are used by an AI system to design the next round of experiments and iteratively improve the confidence in predictions. Research website: https://pharmb.io
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
Presentation at SLAS Europe 2019 in Barcelona on 28 june, 2019.
High-content microscopy in automated laboratories present many challenges for storing and processing data, and to build AI models to aid decision making. We have established an informatics system to serve a robotized cell profiling setup with incubators, liquid handling and high-content microscopy for microplates. The informatics system consists of computational infrastructure (CPUs, GPUs, storage), middleware (Kubernetes), imaging database and software (OMERO), and workflow system (Pachyderm) to perform online prioritization of new data, and automate the process from acquired images to continuously updated and deployed AI models. The AI methodologies include Deep Learning models trained on image data, and conventional machine learning models trained on data from Cell Painting experiments. The microservice architecture makes the system scalable and expandable, and a key objective is on improving screening and toxicity assessment using AI-aided intelligent experimental design.
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
Orange Israel presentation announcing the winners in their iPhone startAPP applications contest, as presented in the MobileMonday Tel Aviv January 2010 event.
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
Discovering new drugs is a lengthy and expensive process. This means that finding new uses for existing drugs can help create new treatments in less time and with less time. The difficulty is in finding these potential new uses.
How do we find these undiscovered uses for existing drugs?
We can unify the available structured and unstructured data sets into a knowledge graph. This is done by fusing the structured data sets, and performing named entity extraction on the unstructured data sets. Once this is done, we can use deep learning techniques to predict latent relationships.
In this talk we will cover:
Building the knowledge graph
Predicting latent relationships
Using the latent relationships to repurpose existing drugs
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
This presentation described how data-driven chemoinformatics methods may automate much of what has historically been done by a medicinal chemist. It explored what is reasonable to expect “AI” approaches might achieve, and what is best left with a human expert. The implications of automation for the human-machine interface were explored and illustrated with examples from Bradshaw, GSK’s experimental automated design environment.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Bioinformatics may be defined as the field of science
in which biology, computer science, and information
technology merge to form a single discipline. Its ultimate
goal is to enable the discovery of new biological insights as
well as to create a global perspective from which unifying
principles in biology can be discerned by means of
bioinformatics tools for storing, retrieving, organizing and
analyzing biological data. Also most of these tools possess
very distinct features and capabilities making a direct
comparison difficult to be done. In this paper we propose
taxonomy for characterizing bioinformatics tools and briefly
surveys major bioinformatics tools under each categories.
Hopefully this study will stimulate other designers
and
experienced end users understand the details of particular
tool categories/tools, enabling them to make the best choices
for their particular research interests.
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
Presentation by Ola Spjuth (Uppsala University and Scaleout) at the Chemical Biology Seminar Series, February 6th, at Karolinska Institutet and Science for Life Laboratory, Stockholm, Sweden.
ABSTRACT
Phenotypic profiling of cells with high-content imaging is emerging as an important methodology with high predictive power. The true power of these methods comes when integrated into automated, robotized systems that can be run continuously and not restricted to batch analysis. One of the main challenges then becomes how to manage and continuously analyze the large amounts of data produced. In this talk I will present our efforts to establish an automated lab for cell profiling of drugs using multiplexed fluorescence imaging (Cell Painting). I will describe our computational and lab infrastructure as well as the systems, tools an methods we are developing to sustain continuous profiling of cells and continuous AI modeling. A key objective in the group is on improving screening and toxicity assessment, but also to explore predictions of mechanisms and pathways. The long-term goal is to build a closed-loop system where results from analyses are used by an AI system to design the next round of experiments and iteratively improve the confidence in predictions. Research website: https://pharmb.io
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
Presentation at SLAS Europe 2019 in Barcelona on 28 june, 2019.
High-content microscopy in automated laboratories present many challenges for storing and processing data, and to build AI models to aid decision making. We have established an informatics system to serve a robotized cell profiling setup with incubators, liquid handling and high-content microscopy for microplates. The informatics system consists of computational infrastructure (CPUs, GPUs, storage), middleware (Kubernetes), imaging database and software (OMERO), and workflow system (Pachyderm) to perform online prioritization of new data, and automate the process from acquired images to continuously updated and deployed AI models. The AI methodologies include Deep Learning models trained on image data, and conventional machine learning models trained on data from Cell Painting experiments. The microservice architecture makes the system scalable and expandable, and a key objective is on improving screening and toxicity assessment using AI-aided intelligent experimental design.
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
Orange Israel presentation announcing the winners in their iPhone startAPP applications contest, as presented in the MobileMonday Tel Aviv January 2010 event.
The TPSI (1000 Point Strength Index) is the most comprehensive and precise market measurement tool of its kind. The US Economic TPSI, only one of numerous TPSIs developed, is based on hundreds of indicators and calculations via Competitive Analytics' proprietary econometric model.
The presentation is in first person perspective and looks at the life of a man named Jamal and the tough working conditions he had to live through during the late 1800's and the early 1900's. The presentation covers social, political and cultural aspects of life during that time.
Big Data for International DevelopmentAlex Rascanu
Alex Rascanu delivered the "Big Data for International Development" presentation at the International Development Conference that took place on February 7, 2015 at University of Toronto Scarborough.
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
Big Data is based on the vision of providing users and applications with a more complete picture of the reality supported and mediated by data. This vision comes with the inherent price of data variety, i.e. data which is semantically heterogeneous, poorly structured, complex and with data quality issues. Despite the hype on technologies targeting data volume and velocity, solutions for coping with data variety remain fragmented and with limited adoption. In this talk we will focus on emerging data management approaches, supported by semantic technologies, to cope with data variety. We will provide a broad overview of semantic computing approaches and how they can be applied to data management challenges within organizations today. This talk will allow the audience to have a glimpse into the next-generation, Big Data-driven information systems.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
We propose a software layer called GUEDOS-DB upon Object-Relational Database Management System ORDMS. In this work we apply it in Molecular Biology, more precisely Organelle complete genome. We aim to offer biologists the possibility to access in a unified way information spread among heterogeneous genome databanks. In this paper, the goal is firstly, to provide a visual schema graph through a number of illustrative examples. The adopted, human-computer interaction technique in this visual designing and querying makes very easy for biologists to formulate database queries compared with linear textual query representation.
The Rensselaer Institute for Data Exploration and Applications is addressing new modes of data exploration and integration to enhance the work of campus researchers (and beyond). This talk outlines the "data exploration" technologies being explored
Usage of AI and machine learning models is likely to become more commonplace as larger swaths of the economy embrace automation and data-driven decision-making. While these predictive systems can be quite accurate, they have been treated as inscrutable black boxes in the past, that produce only numeric predictions with no accompanying explanations. Unfortunately, recent studies and recent events have drawn attention to mathematical and sociological flaws in prominent weak AI and ML systems, but practitioners usually don’t have the right tools to pry open machine learning black-boxes and debug them.
This presentation introduces several new approaches to that increase transparency, accountability, and trustworthiness in machine learning models. If you are a data scientist or analyst and you want to explain a machine learning model to your customers or managers (or if you have concerns about documentation, validation, or regulatory requirements), then this presentation is for you!
Computational Approaches to Systems BiologyMike Hucka
Presentation given at the Sydney Computational Biologists meetup on 21 August 2013 (http://australianbioinformatics.net/past-events/2013/8/21/computational-approaches-to-systems-biology.html).
This talk presents areas of investigation underway at the Rensselaer Institute for Data Exploration and Applications. First presented at Flipkart, Bangalore India, 3/2015.
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemSemantic Web Company
Knowledge graphs and graph-based data in general are becoming increasingly important for addressing various data management challenges in industries such as financial services, life sciences, healthcare or energy.
At the core of this challenge is the comprehensive management of graph-based data, ranging from taxonomy to ontology management to the administration of comprehensive data graphs along with a defined governance framework. Various data sources are integrated and linked (semi) automatically using NLP and machine learning algorithms. Tools for securing high data quality and consistency are an integral part of such a platform.
PoolParty 7.0 can now handle a full range of enterprise data management tasks. Based on agile data integration, machine learning and text mining, or ontology-based data analysis, applications are developed that allow knowledge workers, marketers, analysts or researchers a comprehensive and in-depth view of previously unlinked data assets.
At the heart of the new release is the PoolParty GraphEditor, which complements the Taxonomy, Thesaurus, and Ontology Manager components that have been around for some time. All in all, data engineers and subject matter experts can now administrate and analyze enterprise-wide and heterogeneous data stocks with comfortable means, or link them with the help of artificial intelligence.
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges
X-Meeting 2015
Authors: Jamisson Freitas, Marcel Caraciolo, Victor Diniz, Rodrigo Alexandre and João Bosco Oliveira
The suite of free software tools created within the OpenCB (Open Computational Biology – https://github.com/opencb) initiative makes possible to efficiently manage large genomic databases.
These tools are not widely used, since there is quite a steep learning curve for their adoption, thanks to the complexity of the software stack, but they may be really cost-effective for hospitals, research institutions etcetera.
The objective of the talk is showing the potential of the OpenCB suite, the information to start using it and the advantages for the end users. BioDec is currently deploying a large OpenCGA installation for the Genetic Unit of one of the main Italian Hospitals, where data in the order of the hundreds of TBs will be managed and analyzed by bioinformaticians.
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...Amazon Web Services
"Genomic sequencing is growing at a rate of 100 million sequences a year, translating into 40 exabytes by the year 2025. Handling this level of growth and performing big data analytics is a massive challenge in scalability, flexibility, and speed. In this session, learn from pioneering genomic sequencing company WuXi NextCODE, which handles complex and performance-heavy database and genomic sequencing workloads, about moving from on premises to all-in on the public cloud. Discover how WuXi NextCODE was able to achieve the performance that its workloads demand and surpass the limits of what it was able to achieve previously in genomic sequencing. This session is brought to you by AWS partner, NetApp, Inc.
1. Integrating large, fast-moving, and
heterogeneous data sets in biology.
C. Titus Brown
Asst Prof, CSE and
Microbiology;
BEACON NSF STC
Michigan State University
ctb@msu.edu
2. Introduction
Background:
Modeling & data analysis undergrad =>
Open source software development + software
engineering +
developmental biology + genomics PhD =>
Bio + computer science faculty =>
Data driven biology
Currently working with next-gen sequencing data
(mRNAseq, metagenomics, difficult genomes).
Thinking hard about how to do data-driven
modeling & model-driven data analysis.
3. Goal & outline
Address challenges and opportunities of
heterogeneous data integration: 1000 ft view.
Outline:
What types of analysis and discovery do we want
to enable?
What are the technical challenges, common
solutions, and common failure points?
Where might we look for success stories, and
what lessons can we port to biology?
My conclusions.
4. Specific types of questions
“I have a known chemical/gene interaction; do I see it
in this other data set?”
“I have a known chemical/gene interaction; what other
gene expression is affected?”
“What does chemical X do to overall phenotype, effect
on gene expression, altered protein localization, and
patterns of histone modification?”
More complex/combinatorial interactions:
What does this chemical do in this genetic background?
What kind of additional gene expression changes are
generated by the combination of these two chemicals?
What are common effects of this class of chemicals?
5. What general behavior do we want to
enable?
Reuse of data by groups that did not/could not
produce it.
Publication of reusable/“fork”able data analysis
pipelines and models.
Integration of data and models.
Serendipitous uses and cross-referencing of data sets
(“mashups”).
Rapid scientific exploration and hypothesis generation
in data space.
6. (Executable papers & data reuse)
ENCODE
All data is available; all processing scripts for
papers are available on a virtual machine.
QIIME (microbial ecology)
Amazon virtual machine containing software and data
for:
“Collaborative cloud-enabled tools allow rapid,
reproducible biological insights.” (pmid 23096404)
Digital normalization paper
Amazon virtual machine, again:
http://arxiv.org/abs/1203.4802
7. Executable papers can support easy
replication & reuse of code, data.
(IPython
Notebook; also
see RStudio)
http://ged.msu.edu/papers/2012-
diginorm/notebook/
8. What general behavior do we want to
enable?
Reuse of data by groups that did not/could not
produce it.
Publication of reusable/”fork”able data analysis
pipelines and models.
Integration of data and models.
Serendipitous uses and cross-referencing of data sets
(“mashups”).
Rapid scientific exploration and hypothesis generation
in data space.
9. An entertaining digression --
A mashup of Facebook “top 10 books by college” and per-college SAT rank
http://booksthatmakeyoudumb.virgil.gr/
10. Technical obstacles
Syntactic incompatibility
The first 90% of bioinformatics: your IDs are different
from my IDs.
Semantic incompatibility
The second 90% of bioinformatics: what does “gene”
mean in your database?
Impedance mismatch
SQL is notoriously bad at representing intervals and
hierarchies
Genomes consist of intervals; ontologies consist of
hierarchies!
…SQL databases dominate (vs graph or object DBs).
Data volume & velocity
Large & expanding data sets just make everything
harder.
Unstructured data
aka “publications” – most scientific knowledge is “locked
11. Typical solutions
“Entity resolution”
Accession numbers or other common identifiers
…requires global naming system OR translators.
Top down imposition of structure
Centralized DB;
“Here is the schema you will all use”;
…limits flexibility, prevents use of unstructured data, heavyweight.
Ontologies to enable “correct” communication
Centrally coordinated vocabulary
…slow, hard to get right, doesn’t solve unstructured data problem.
Balancing theoretical rigor with practical applicability is particularly
hard.
Ad hoc entity resolution (“winging it”)
Common solution
…doesn’t work that well.
13. Rephrasing technical goals
How can we best provide a platform or platforms to
support flexible data integration and data
investigation across a wide range of data sets and
data types in biology?
My interests:
Avoid master data manager and centralization
Support federated roll-out of new data and
functionality
Provide flexible extensibility of ontologies and
hierarchies
Support diverse “ecology” of databases,
14. Success stories outside of
biology?
Look for domains:
with really large amounts of heterogenous data,
that are continually increasing in size,
are being effectively mined on an ongoing basis,
Have widely used programmatic interfaces that
support “mashups” and other cross-database stuff,
and are intentional, with principles that we can
steal or adapt.
15. Success stories outside of
biology?
Look for domains:
with really large amounts of heterogenous data,
that are continually increasing in size,
are being effectively mined on an ongoing basis,
Have widely used programmatic interfaces that
support “mashups” and other cross-database stuff,
and are intentional, with principles that we can
steal or adapt.
Amazon.
16. Amazon:
> 50 million users, > 1 million product partners,
billions of reviews, dozens of compute services …
Continually changing/updating data sets.
Explicitly adopted a service-oriented architecture
that enables both internal and external use of this
data.
For example, the amazon.com Web site is itself
built from over 150 independent services…
Amazon routinely deploys new services and
functionality.
17. Sources:
The Platform Rant (Steve Yegge) -- in which he
compares the Google and Amazon approaches:
https://plus.google.com/112678702228711889851/
posts/eVeouesvaVX
A summary at HighScalability.com:
http://highscalability.com/amazon-architecture
(They are both long and tech-y, note, but the first
is especially entertaining.)
18. A brief summary of core
principles
Mandates from the CEO:
1. All teams must expose data and functionality
solely through a service interface.
2. All communication between teams happens
through that service interface.
3. All service interfaces must be designed so that
they can be exposed to the outside world.
19. More colloquially:
“You should eat your own dogfood.”
Design and implement the database and database
functionality to meet your own needs; and only use the
functionality you’ve explicitly made available to
everyone.
To adapt to research: database functionality should be
designed in tightly integration with researchers who are
using it, both at a user interface level and
programmatically.
(Genome databases have done a really good job of this,
albeit generally in a centralized model.)
21. A platform view?
Diff'n gene Data
Metabolic
expression exploration
model
query WWW
Gene ID
translator
Isoform
Chemical resolution/
relationships comparison
Expression
normalization
Expression Expression Expression Expression
data data data data II
(tiling) (microarray) (mRNAseq) (mRNAseq)
22. A few points
Open source and agile software development
approaches can be surprisingly effective and
inexpensive.
Developing services in small groups that include
“customer-facing developers” helps ensure utility.
Implementing services in the “cloud” (e.g. virtual
machines, or on top of “infrastructure as a
service” services) gives developer flexibility in
tools, approaches, implementation; also enables
scaling and reusability.
23. Combining modelling with data
Data-driven modeling: connections and parameters
can be, to some extent, determined from data.
Model-driven data investigation: data that doesn’t fit
the “known” model is particularly interesting.
The second approach is essentially how particle
physicists work with accelerator data: build a model &
then interpret the data using the model.
(In biology, models are less constraining, though; more
unknowns.)
26. Using developmental models
Models can contain useful abstractions of
specific processes; here, the direct effects of
blocking nuclearization of B-catenin can be
predicted by following the connections.
Models provide a common language for (dis)agreement
a community.
28. Social obstacles
Training of biologically aware software developers
is lacking.
Molecular biologists are still very much of a
computationally naïve mindset: “give me the
answer so I can do the real work”
Incentives for data sharing, much less useful
data sharing are not yet very strong.
Pubs, grants, respect...
Patterns for useful data sharing are still not well
understood, in general.
29. Other places to look
NEON and other NSF centers (e.g. NCEAS) are
collecting vast heterogenous data sets, and are
explicitly tackling the data
management/use/integration/reuse problem.
SBML (“Systems Biology Markup Language”) is a
modeling descriptive language that enables
interoperability of modeling software.
Software Carpentry runs free workshops on
effective use of computation for science.
30. My conclusions…
We need a “platform” mentality to make the most use
of our data, even if we don’t completely embrace
loose coupling and distribution.
Agile and end-user focused software development
methodologies have worked well in other areas; much
of the hard technical space has already been
explored in Internet companies (and probably social
networking companies, too).
Data is most useful in the context of an explicit model;
models can be generated from data, and models can
feed back into data gathering.
31. Things I didn’t discuss
Database maintenance and active curation is
incredibly important.
Most data only makes sense in the context of other
data (think: controls; wild type vs knockout; other
backgrounds; etc.) – so we will need lots more data to
interpret the data we already have.
“Deep learning” is a promising field for extracting
correlations from multiple large data sets.
All of these technical problems are easier to solve
than the social problems (incentives; training).
32. Thanks --
This talk and ancillary notes will be available on my
blog ~soon:
http://ivory.idyll.org/blog/
Please do contact me at ctb@msu.edu if you have
questions or comments.
Editor's Notes
Separation of concerns; multiple implementation possible; when publish, don’t have to talk to anybody to get “your method” integrated; recognition that everything is changing. Embrace chaos.