Transparency in the Data Supply Chain


Published on

Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.

Given at:

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transparency in the Data Supply Chain

  1. 1. Transparency in the Data Supply Chain Paul Groth (@pgroth) Web & Media Group Department of Computer Science VU University Amsterdam
  2. 2. Outline • Data integration for analysis – i.e. remixing data • The need for transparency • Two solutions • The future
  3. 3. @Open_PHACTS
  4. 4. Public Domain Drug Discovery Data: Pharma are accessing, processing, storing & re-processing Literature Genbank Patents PubChem Data Integration Databases Data Analysis Downloads x Repeat @ each company Firewalled Databases Why?
  5. 5. Prioritised Research Questions Number sum Nr of 1 Question 15 12 9 All oxido,reductase inhibitors active <100nM in both human and mouse 18 14 8 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound? 24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives. 32 13 8 For a given interaction profile, give me compounds similar to it. 37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X. 38 13 8 41 13 8 44 13 8 46 13 8 59 14 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not). A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature. Give me all active compounds on a given target with the relevant assay data Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease) Identify all known protein-protein interaction inhibitors
  6. 6. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse From Mabel Loza - USC team
  7. 7. From Mabel Loza - USC team
  8. 8. From Mabel Loza - USC team
  9. 9. From Mabel Loza - USC team
  10. 10. Research question 15: All oxido reductase inhibitors active < 100nM in both human and mouse ChEMBL: Search target Oxidoreductase: 481 targets from different species Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained Table exported to a excel spreadsheet and manually filtered From Mabel Loza - USC team
  11. 11. 5 people Working 6 hours
  12. 12. Problem: Data Integration Queries Queries Data Warehouse Mediator Extract Transform Load Query Reformulation Data Source Data Source Data Source Data Source
  13. 13. Using the Power of Open PHACTS, London, 22-23 April 2013 Core Platform Applications Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” Linked Data API (RDF/XML, TTL, JSON) P12374 EC2.43.4 CS4532 Semantic Workflow Engine Chemistry Registration Normalisation & Q/C Data Cache (Virtuoso Triple Store) VoID VoID VoID Nanopub Public Ontologies Db Domain Specific Services Db Public Content VoID Nanopub Db VoID Nanopub Db Commercial User Annotations index
  14. 14. Open PHACTS Explorer 15
  15. 15. Open PHACTS Explorer ? 16
  16. 16. Credits: Curt Tilmes, Peter Fox Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G., "Provenance Representation for the National Climate Assessment in the Global Change Information System," Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013
  17. 17. Problem: I don’t trust your assessment what is it based on?
  18. 18. Tension: Integrated & Summarized Data Transparency & Trust
  19. 19. Solution Integrating and exposing provenance provided by multiple sources
  20. 20.
  21. 21. National Climate Change Assessment Provenance
  22. 22. PROV the database as a black box Q
  23. 23. Goal • the capability to trace back, for each query result, the complete list of sources and how they were combined to deliver a result.
  24. 24. Implement In a Graph Database at Scale Marcin Wylot Philippe Cudré-Mauroux Exascale Lab University of Fribourg
  25. 25. TriplePROV [WWW2014]
  26. 26. Provenance Polynomials
  27. 27. Test on large messy data • Billion Triple Challenge – Crawled from the linked open data cloud • Web Data Commons – RDFa, Microdata extracted from common crawl • 115 million triples (25 GB) • 8 Queries defined for BTC – T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009.
  28. 28. External + Internal Provenance • Unified queries over external and database provenance • Adapting query results based on provenance • Performance improvements
  29. 29. FUTURE
  30. 30. 60 % of time is spent on data preparation
  31. 31. Big Data is often lots of small data
  32. 32. Questions? • More info: – – – – Paul Groth, "Transparency and Reliability in the Data Supply Chain," IEEE Internet Computing, vol. 17, no. 2, pp. 69-71, MarchApril, 2013 – Paul Groth, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 – Marcin Wylot, Philippe Cudré-Mauroux and Paul Groth. TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store. WWW 2014
  33. 33. Backup
  34. 34. Hack Sparql
  35. 35. What’s the overhead? Setup Source and complete trace (i.e. triple level)
  36. 36. Annotations: Propagate annotations through the query processing pipeline
  37. 37. What’s the overhead?