Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Europe at eHealth Week 2017: Linking Big Data in Health

478 views

Published on

Of the four V's of big data – Volume, Velocity, Variety and Veracity – the most challenging for the health sector is Variety. Health data comes from many sources, formats and standards – how can we bring these together to reap the benefits of big data technologies?

Big Data Europe is tackling this challenge head-on, building a big data infrastructure flexible enough to tackle all seven Societal Challenges identified by Horizon 2020. Here we demonstrate our pilot implementation of Open PHACTS, which integrates life science data for drug discovery.

12 May 2017

Published in: Health & Medicine
  • Be the first to comment

Big Data Europe at eHealth Week 2017: Linking Big Data in Health

  1. 1. LINKING BIG DATA IN HEALTH Open PHACTS in the Big Data Europe infrastructure Kiera McNeice, Open PHACTS Foundation12 May 2017
  2. 2. What is ‘Big Data’?  “extremely large data sets that may be analyzed computationally”  “data sets that are so large or complex that traditional data processing application software is inadequate to deal with them”
  3. 3. Big Data challenges: The four V’s Volume Velocity Veracity Variety
  4. 4. Big Data Europe Objectives  “Big Data Europe will undertake the foundational work for enabling European companies to build innovative multilingual products and services based on semantically interoperable, large-scale, multi- lingual data assets and knowledge, available under a variety of licenses and business models.”
  5. 5. Actual Big Data Europe Objectives  Build foundational Big Data infrastructure that: o Is open source o Makes it simple to get started with Big Data o Supports a variety of use cases o Embraces emerging Big Data technologies o Enables simple integration with custom components
  6. 6. The data analytics landscape
  7. 7. Key actors
  8. 8. Actual Big Data Europe Objectives  Build foundational Big Data infrastructure that: o Is open source o Makes it simple to get started with Big Data o Supports a variety of use cases o Embraces emerging Big Data technologies o Enables simple integration with custom components
  9. 9. BDE platform architecture  Modular and flexible
  10. 10. BDE platform architecture  Modular and flexible
  11. 11. Applications: The 7 Societal Challenges  Life Sciences and Health  Food and Agriculture  Energy  Transport  Climate  Social Sciences  Security
  12. 12. BDE Pilot Projects
  13. 13. SC1: Life Sciences and Health
  14. 14. SC1: Life Sciences and Health
  15. 15. SC1: Life Sciences and Health
  16. 16. SC2: Food and Agriculture
  17. 17. SC2: Food and Agriculture Partners: FAO, the largest autonomous agency within the United Nations system and one of the main players in the agricultural information community. Big Data Focus area: Large-scale distributed agricultural data integration Selected Key Data assets: INFOODS, AQUASTAT Green Learning Network (GLN), Agricultural Bibliography Network (ABN), AgroVoc, AquaMaps, Fishbase Semantic Web Company (SWC) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information management. Their core product is to extract meaning from big data by making use of linked data technologies. Agroknow is a company that captures, organizes and adds value to the rich information available in agricultural and food sciences, in order to make it universally accessible, useful and meaningful.
  18. 18. SC2: Food and Agriculture Pilot focus area: Viticulture (from the Latin word for vine) is the science, production, and study of grapes. It deals with the series of events that occur in the vineyard.
  19. 19. SC2: Food and Agriculture Pilot 2: Support advanced crop data discovery, processing, combining and visualization from distributed and heterogeneous data repositories Reasons:  Vine and Wine sector: emerging market in EU  Sustainability and biodiversity challenges: local varieties are being lost  Exploitation of new grapevine varieties and clones in terms of climate change adaptation  Quality and health status of viticultural products  Contribution to human health (antioxidants, prevention of heart diseases etc.)  Wide variety of heterogeneous (and big) data from various information sources
  20. 20. SC3: Energy
  21. 21. SC3: Energy Partners: Big Data Focus area: Real-time turbine monitoring stream processing and analytics Selected Key Data assets: European Energy Exchange Data, smart meter sensor data, gas/fuels market/price data, consumption statistics, stratigraphic model data (geology, geophysics) A public entity supervised by the Ministry of Environment, Energy and Climate Change in Greece, founded in September 1987, active in the fields of Renewable Energy Sources (RES), Rational Use of Energy (RUE) and Energy Saving (ES). NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  22. 22. SC3: Energy Pilot focus: System monitoring in energy production units
  23. 23. SC3: Energy Pilot 3: Operation, maintenance and production forecasting for wind turbines on real-time sensor data. Reasons:  Current technology is not able to deal with full amount of available valuable data  Economic benefit of predicting output and prevention of damage (if one can predict one part about to fail it can be prevented that other parts get damaged)  Large continuous stream of sensor data, perfect to test our platform
  24. 24. SC4: Transport
  25. 25. SC4: Transport Partners: Big Data Focus area: Streaming sensor network & geo-spatial data integration Selected Key Data assets: GTFS data, OSM/LinkedGeoData, MobilityMaps, Transport sensor data, ROSATTE Road safety attributes, European Road Data Infrastructure - EuroRoadS The Fraunhofer Society is a German research organization with 67 institutes spread throughout Germany, each focusing on different fields of applied science. The Centre for Research and Technology-Hellas (CERTH) founded in 2000 is one of the leading research centres in Greece. CERTH includes the Hellenic Institute of Transport (HIT): Land, Sea and Air Transportation as well as Sustainable Mobility services ERTICO - ITS Europe is a partnership of around 100 companies and institutions involved in the production of Intelligent Transport Systems (ITS).
  26. 26. SC4: Transport Pilot focus: Information mobility and traffic planning
  27. 27. SC4: Transport Pilot 4: Multisource data collection for the provision of accurate info- mobility and advanced transport planning service in Thessaloniki, Greece Reasons:  Congestion is a major problem in Europe, especially in urban areas.  Utilising real-time probe data for the provision of accurate info-mobility services and advanced transport planning, leads to better decisions  The use of mobility data coming from multiple sources presents significant challenges, especially due to the different nature of the datasets both in content and spatio- temporal terms as well as due to the fact that the data should be collected and processed in real time.
  28. 28. SC5: Climate
  29. 29. SC5: Climate Partners: Big Data Focus area: Enormous simulation time. Extremely complicated computing model. Selected Key Data assets: European Grid Infrastructure (EGI). Access to several data centres hosted at CNRS-Lyon, NCSR-D Athens, INFN-Milan, NIKhEF-Amsterdam. A public entity supervised by the Ministry of Environment, Energy and Climate Change in Greece, founded in September 1987, active in the fields of Renewable Energy Sources (RES), Rational Use of Energy (RUE) and Energy Saving (ES). NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  30. 30. SC5: Climate Pilot focus: Supporting data-intensive climate research
  31. 31. SC5: Climate Pilot 5: Downscaling, and retrieval process on (raw) climate data via User- defined parameters (e.g. geographical areas, time period, physical variables, computational grids, time steps) Reasons:  The provision of Climate model data satisfies an important objective, that of assessing the potential impacts of climate change on well being for adaptation, prevention and mitigation measures and supporting other policy making decisions.  The awareness led to the availability of huge datasets  Downscaling is a computationally intensive process
  32. 32. SC6: Social Sciences
  33. 33. SC6: Social Sciences Partners: Big Data Focus area: Statistical and research data linking & integration Selected Key Data assets: Federated social sciences data catalogs, statistical data from public data portals and statistical offices (e.g. EuroStats, UNESCO, WorldBank) CESSDA provides large scale, integrated and sustainable data services to the social sciences. CESSDA is organised as a limited company under Norwegian law owned and financed by the individual EU member states’ ministry of research or a delegated institution. NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  34. 34. SC6: Social Sciences Pilot focus: Citizens budget spending on the municipal level
  35. 35. SC6: Social Sciences Pilot 6: Citizens budget on the municipal level Reasons:  Budget: the most important document of public policy  Budget execution affects everyday lives  Citizens are more involved in city level  Having a platform that integrates heterogeneous budget data (many municipality have their own data formats) and calculates infographics would benefit the citizens, the research community and policy makers
  36. 36. SC7: Security
  37. 37. SC7: Security Partners: Big Data Focus area: Image data analysis Selected Key Data assets: Earth Observation data (e.g. Very High Resolution Satellite Imagery acquired from commercial providers and governmental systems) and collateral data for supporting CFSP/CSDP missions and operations NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes. The Centre supports the decision making of the European Union in the field of the Common Foreign and Security Policy (CFSP), by providing products and services resulting from the exploitation of relevant space assets and collateral data, including satellite imagery and aerial imagery, and related services.
  38. 38. SC7: Security Pilot focus: Getting insight into man-made surface changes triggered by automatic detection, news, or social media information
  39. 39. SC7: Security Pilot 7: Ingestion of remote sensing images and social sensing data to detect and verify man-made changes on the Earth’s surface for security applications Reasons:  Evacuation route planning  Monitoring of critical infrastructures  Border security  Satellite image data is HUGE and computationally intensive to compare  Smart ‘focus’ algorithms are needed to prioritize the analysis jobs
  40. 40. Back to SC1…
  41. 41. The Open PHACTS Project
  42. 42. Drug discovery using public data Literature PubChem Genbank Patents Databases Downloads Data Integration Data Analysis Firewalled Databases
  43. 43. The situation in 2010… GSK Pfizer AstraZeneca Roche Novartis Merck-Serono Janssen
  44. 44. The Open PHACTS mission “Integrate multiple research biomedical data resources into a single, open and sustainable access point”
  45. 45. An IMI project (2011-2016)
  46. 46. Focus on researcher needs ChEMBL DrugBank Gene Ontology Wikipathways UniProt ChemSpider UMLS ConceptWiki ChEBI TrialTrove GVKBio GeneGo TR Integrity “Find me compounds that inhibit targets in NFkB pathway assayed in only functional assays with a potency <1 μM” “What is the selectivity profile of known p38 inhibitors?” “Let me compare MW, logP and PSA for known oxidoreductase inhibitors” DisGeNet neXtProt ChEMBL Target Class ENZYME FDA adverse events SureChEMBL
  47. 47. Ranked research questions Number sum Nr of 1 Question 15 12 9 All oxidoreductase inhibitors active <100nM in both human and mouse 18 14 8 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound? 24 13 8 Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives. 32 13 8 For a given interaction profile, give me compounds similar to it. 37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X. 38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not). 41 13 8 A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature. 44 13 8 Give me all active compounds on a given target with the relevant assay data 46 13 8 Give me the compound(s) which hit most specifically the multiple targets in a
  48. 48. Connecting the Data
  49. 49. Challenges: Identifiers Andy Law’s third law:  The number of unique identifiers assigned to an individual is never less than the number of institutions involved in the study P12047 X31045 GB:29384 http://bioinformatics.roslin.ac.uk/lawslaws/
  50. 50. Challenges: Similarity  Q: Are these records the same? DrugBankChemSpider PubChem  A: It depends on your task!
  51. 51. Everyone loves standards… …that’s why we have so many of them! https://xkcd.com/927/
  52. 52. Semantic linking (RDF) Link and store data as semantic “triples”: [Compound] acts on [Target] Subject Predicate Object––
  53. 53. Semantic mappings Raw mappings: 25,087,328
  54. 54. Semantic mappings Computed mappings: 200,000,000+
  55. 55. The Open PHACTS Discovery Platform
  56. 56. Open PHACTS architecture Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
  57. 57. Quality Assurance Chemical Validation and Standardisation Platform (CVSP) developed by the Royal Society of Chemistry
  58. 58. The Platform
  59. 59. Using Open PHACTS
  60. 60. Accessing the data: API https://dev.openphacts.org/
  61. 61. Accessing the data: Workflow tools
  62. 62. Example workflow  Q10: For a given compound, summarise all similar compounds and their activities  CC1=C(C(C(=C(N1) C)C(=O)OC)C2=CC =CC=C2[N+](=O)[ O-])C(=O)OC
  63. 63. Example workflow: KNIME
  64. 64. Example workflow: Heatmap
  65. 65. Benefits of Open PHACTS  Efficiency: Queries that once took days can now be done in less than an hour  Novelty: Semantically integrated databases allow for completely new ways of analysing the data  Cost: Sharing cost and effort in a precompetitive project saved “millions” “Integration of different databases is difficult, costly, and time consuming, and probably would not have been done at this level of quality without Open PHACTS.”
  66. 66. Open PHACTS in Big Data Europe
  67. 67. …so why rebuild it with BDE?  Integration into a wider platform  Flexibility, scalability, extensibility  Local installation of the entire Open PHACTS infrastructure!
  68. 68. Requirements Hardware:  150GB of disk space (ideal: 250GB)  16GB of RAM (ideal: 128GB)  4 CPU core (ideal: 8 cores) Prerequisites:  Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7)  Docker and Docker Compose  Fast Internet connection https://github.com/openphacts/ops-docker https://data.openphacts.org/
  69. 69. Conclusions
  70. 70. Successes of Open PHACTS  Integrated a large variety of data sources using semantic web linking (RDF triples)  Project focussed on solving real, practical use cases (and succeeded!)  Re-building within the BDE Docker infrastructure allows for greater flexibility, local installation
  71. 71. What’s next?  Refresh of all data sources  Identify new data sources o What’s your big data with health problem?  BDE SC1 (Health) Workshop in autumn o Planned for eHealthTallinn 2017, 16-18 October http://sm.ee/en/ehealthtallinn-2017

×