All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
Is one enough? Data warehousing for biomedical research
1. Is one enough?
Data warehousing for biomedical research
Gregory Landrum1, Matthias Wrobel2, Nicholas Clare2
1 KNIME.com AG
2 Novartis Institutes for BioMedical Research, Basel
2016 Basel Life Sciences Week
21 September 2016
2. Overview
§ Motivation: why is this both important and hard?
§ Three data warehouse case studies
§ Is one enough?
2
3. Challenges for real world data management and
analysis
§ Lots of heterogeneous data from multiple sources, both
internal and external
§ Source data are frequently messy and unstructured
§ Constant flow of new data into the system
§ Diverse stakeholders and users
§ Highly diverse and complex questions to ask of the data
§ Serious performance requirements
4. Storing and managing real-world data
§ Warehouse vs mart vs federation vs “data lake” vs linked
data/triple store vs …
§ Many, many different approaches, technologies, and
architectures.
§ Most are applicable in some scenarios but there is no
silver bullet.
Stonebraker, Michael, and Uğur Çetintemel. ”One size fits all": an idea whose
time has come and gone. Proceedings, 21st International Conference on Data
Engineering, 2005. ICDE 2005. IEEE, 2005.
https://cs.brown.edu/~ugur/fits_all.pdf
5. Storing the data isn’t the end of the story
You probably want to be able to get the data back out.
6. Storing the data isn’t the end of the story
http://flickr.com/photos/35703177@N00/1063555182
You probably want to be able to get the data back out.
Extracting
insights from a
data lake
Jokes aside, allowing the data to be queried
and retrieved efficiently is essential
7. Nature of the data
Shape of the data generated for a project
Hit finding
106 rows, 1-2 columns
Hit-to-lead
103 rows, 5-10 columns
Lead optimization
102 rows, 102 columns Clinic
1 rows, 104 columns
“omics” data (can appear at multiple stages) is different still
10. It’s more than just standard queries and reports
§ We also want to enable data scientists (informaticians)
§ They are going to generally want to ask more complex and varied
questions
§ Will likely want to retrieve larger data quantities
§ Would be great to help them with their 80% problem
12. Real-world case studies
§ Avalon:
• Productive, maintained, and in active use for 15 years.
§ MAGMA:
• Productive, maintained, and in active use for >5 years.
§ Entity Warehouse (EW):
• In active development
13. Avalon
§ One table/view per “fact_type” (maps roughly to assay)
• Typical table has about 10 columns
• Big table has about 100 columns
§ One row per measurement
• 10s of rows for short-lived assays
• Typically hundreds to thousands of rows
• More than a million rows for HTS
§ ~30K tables/views
§ Additional tables defining structure of the fact tables
§ Little metadata
§ Tightly coupled to a UI
14. MAGMA
§ Intended to be “the” warehouse
§ Similar type of schema as ChEMBL
§ Results stored in a tall and skinny table
§ Columns for all primitive data types (string, float, int, etc)
§ ~2 billion rows
§ Tables with metadata
15. Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and
associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)
• About a dozen result types
§ One row per measurement
Current size:
• 10s of millions of rows for Activity-Concentration
• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
16. Entity Warehouse: the entity
§ Used to represent the business objects (scientific or otherwise) of
interest
• Compounds
• Samples
• “Assays”
• Proteins
• Projects
• People
• Assay results
• Documents
• etc…
§ Model for entity type stored in a central location
§ Entities can be linked and grouped
18. Entity Warehouse
§ Designed to accommodate both internal and external data
§ Central concept is the entity, entity-entity linkage
§ “Assays” stored as entities with info about their result types and
associated metadata
§ One table per result type (e.g. Activity-Concentration, Activity-Percent)
• About a dozen result types
§ One row per measurement
Current size:
• 10s of millions of rows for Activity-Concentration
• ~100 million rows for Activity-Percent
§ Links/drilldown to original data/systems.
20. Loading the data1
§ The Entity Warehouse is only one part of a large, multi-year data
integration project.
§ The majority of the thought and effort has gone into how to properly
integrate heterogeneous internal and external data sources
§ Conversion to entities, link resolution, some normalization
§ Preservation of links to original data systems
§ Strong focus on performance/timeliness of the load
§ Once the data are loaded: make it broadly accessible (helping with
that 80% rule for data scientists)
1The 80% rule affects us too.
21. CDF architecture
Update Services
Consolidation Layer
Source Layer
Integration Layer
Access LayerVisualization/reporting tools and user interfaces
Entity Services
Entity
Warehouse
Search
Indexes
Custom
Datamart
…
Entities Assays Facts Workflow
Registration
systems
Assay
metadata
systems
Assay data
systems
Logistics
systems
Curation
Framework
Entity
Instance
Reference
Entity &
Property
Definitions
Fact
Instance
Reference
22. One size really doesn’t fit all
§ Just as there is no perfect database
technology for all situations, we don't
think that there's a perfect research
data warehouse for all use cases.
§ The Entity Warehouse will contain
most of the data and meet 90% of the
needs,1 but there are still going to end
up being multiple “warehouses”
§ We will encourage and support the
building and use of data marts by data
scientists and will make it easy to keep
them up to date
§ The warehouse(s) is/are just one piece
of the full data ecosystem
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
1At least we hope so. When it comes to
enabling broad usage of the various types
of 'omics data we'll need to see
23. One size really doesn’t fit all
https://pixabay.com/en/dyke-road-hamburg-port-homes-41832/
§ Maybe this is actually a hopeful
message from the point of view of
a possible standardized
warehouse
§ If there’s only one warehouse, it’s
probably going to be *mine*
§ If I’m using more than one
“warehouse”, then I’m much more
willing to talk about using
something standardized for one of
them
24. Acknowledgements
Past and present members of the
Avalon, MAGMA, and CDF teams:
Bernd Rohde
Joe Ringgenberg
Mathias Asp
Andre Zelenkovas
Ryan Muller
Sandra Mueller
Artem Mitrokhin
Recca Chatterjee
Nabil Hachem
Andreas Koeller
Mark Schreiber
Barry Frishberg
Thomas Mueller
Alberto Gobbi
Peter Ertl
Paul Selzer
Werner Braun
and many more
…
Past and present members of
NIBR NX leadership:
Remy Evard
Steve Cleaver
Ken Robbins
Patrick Warren
Andy Palmer