Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SC1 - Hangout 2: The Open PHACTS pilot

620 views

Published on

Big Data Europe is a EU funded Horizon2020 project and will undertake the foundational work for enabling European companies to build innovative multilingual products and services based on semantically interoperable, large-scale, multi-lingual data assets and knowledge, available under a variety of licenses and business models.

The Open PHACTS Discovery Platform is bringing together pharmacological data resources in an integrated, interoperable infrastructure, and has been developed to reduce barriers to drug discovery in industry, academia and for small businesses.

The first round of pilots for the Big Data Europe project is about to enter the evaluation phase. This also holds for the Societal Challenge 1: Health. For this challenge the Open PHACTS foundation, Manchester University and the VU Amsterdam are working on the Open PHACTS docker and its integration with the Big Data Europe infrastructure.
This presentation will give you:

- a general overview of the infrastructure and the status of the generic components that are being developed
- an outline of the Societal Challenge and the rationale for the pilot
a look into the future pilot options

The intended audience are people acquainted with basic development tools like Docker and GitHub with an interest in Big Data and Drug Discovery.

Published in: Science
  • Be the first to comment

SC1 - Hangout 2: The Open PHACTS pilot

  1. 1. BIG DATA EUROPE H2020 CSA (2015-17) SOCIETAL CHALLENGE “HEALTH” Integrating Big Data, Software & Communities for Addressing Europe’s Societal Challenges06.07.2016
  2. 2. BigDataEurope 6-Jul-16 Today: •  Short overview of Big Data Europe Ronald Siebes •  What is Open PHACTS Stian Soiland-Reyes, Bryn Williams-Jones •  The Big Data Europe infrastructure Erika Pauwels, Aad Versteden •  Pilot 1: The Open PHACTS docker Stian Soiland-Reyes •  Q&A Stian Soiland-Reyes BioExcel and University of Manchester Ronald Siebes VU Amsterdam Erika Pauwels Tenforce Aad Versteden Tenforce Bryn Williams-Jones Open PHACTS Foundation
  3. 3. Big Data Europe 6-Jul-16
  4. 4. 6-Jul-16www.big-data-europe.eu
  5. 5. Partners : 6-Jul-16
  6. 6. Q&A 6-Jul-16www.big-data-europe.eu
  7. 7. Open PHACTS Architecture and Docker install Stian Soiland-Reyes, University of Manchester http://orcid.org/0000-0001-9842-9718 @soilandreyes This work is licensed under a .Creative Commons Attribution 4.0 International License Big Data Europe Webinar, 2016-07-06 This work has been done as part of the BioExcel CoE ( ), a project funded by the EC H2020 program, contract number www.bioexcel.eu EINFRA-5-2015 675728 https://slides.com/soilandreyes/2016-07-06-openphacts 1
  8. 8. http://www.openphacts.org/ Bringing together pharmacological data resources in an integrated, interoperable infrastructure Data sources integrated and linked together so that you can easily see the relationships between compounds, targets, pathways, diseases and tissues. , , , , , , , , , , ChEBI ChEMBL ChemSpider ConceptWiki DisGeNET DrugBank FAERS Gene Ontology neXtProt SureChEMBL, UniProt WikiPathways 2 . 1
  9. 9. Data integration https://www.openphacts.org/2/sci/data.html 2 . 2
  10. 10. https://dev.openphacts.org/docs/2.1 Re-exposed as public API 2 . 3
  11. 11. { "format": "linked-data-api", "version": "1.5", "result": { "_about": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconce "definition": "https://beta.openphacts.org/api-config", "extendedMetadataVersion": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.concep "linkPredicate": "http://www.w3.org/2004/02/skos/core#exactMatch", "activeLens": "Default", "primaryTopic": { "_about": "http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5", "inDataset": "http://www.conceptwiki.org", "exactMatch": [ { "_about": "http://bio2rdf.org/drugbank:DB00398", "description_en": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for th "description": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the t "drugType_en": [ "investigational", "approved" ], "drugType": [ "investigational", "approved" ], "genericName_en": "Sorafenib", "genericName": "Sorafenib", "metabolism_en": "Sorafenib is metabolized primarily in the liver, undergoing oxidative meta "metabolism": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabol "proteinBinding_en": "99.5% bound to plasma proteins.", "proteinBinding": "99.5% bound to plasma proteins.", "toxicity_en": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The "toxicity": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adv "inDataset": "http://www.openphacts.org/bio2rdf/drugbank", 2 . 4
  12. 12. <?xml version="1.0" encoding="utf-8"?> <result format="linked-data-api" version="1.5" href="https://beta.openphacts.org/1.5/compound?uri= <primaryTopic href="http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5"> <prefLabel xml:lang="en">Sorafenib</prefLabel> <exactMatch> <item href="http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336"> <type href="http://rdf.ebi.ac.uk/terms/chembl#SmallMolecule"/> <inDataset href="http://www.ebi.ac.uk/chembl"/> <mw_freebase datatype="double">464.82</mw_freebase> </item> <item href="http://ops.rsc.org/OPS379634"> <smiles>CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F</smiles> <rtb datatype="double">5.0</rtb> <ro5_violations datatype="double">1.0</ro5_violations> <psa datatype="double">92.35</psa> <molweight datatype="double">464.825</molweight> <molformula>C21H16ClF3N4O3</molformula> <logp datatype="double">5.158</logp> <inchikey>MLDQJTXFUGDVEO-UHFFFAOYSA-N</inchikey> <inchi>InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3-6-14)28-20(31 <hbd datatype="double">3.0</hbd> <hba datatype="double">7.0</hba> <inDataset href="http://ops.rsc.org"/> </item> <item href="http://aers.data2semantics.org/resource/drug/NEXAVAR"> <prefLabel>NEXAVAR</prefLabel> <reportedAdverseEvent> <item href="http://aers.data2semantics.org/resource/diagnosis/HEAD_INJURY"> <prefLabel>HEAD INJURY</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> </item> <item href="http://aers.data2semantics.org/resource/diagnosis/SUPRAVENTRICULAR_TACHYCARD <prefLabel>SUPRAVENTRICULAR TACHYCARDIA</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> 2 . 5
  13. 13. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix void: <http://rdfs.org/ns/void#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix ns0: <http://www.openphacts.org/api#> . @prefix ns1: <http://bio2rdf.org/> . @prefix ns2: <http://rdf.ebi.ac.uk/terms/chembl#> . @prefix chembl1336: <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336#> . @prefix linked-data: <http://purl.org/linked-data/api/vocab#> . @prefix msg0: <http://www.openphacts.org/api/> . <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> skos:exactMatch <http://aers.data2semantics.org/resource/drug/NEXAVAR> ; skos:exactMatch <http://aers.data2semantics.org/resource/drug/SORAFENIB> ; skos:exactMatch <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7b skos:exactMatch <http://bio2rdf.org/drugbank:DB00398> ; skos:exactMatch <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336> ; skos:exactMatch <http://ops.rsc.org/OPS379634> ; skos:prefLabel "Sorafenib"@en ; void:inDataset <http://www.conceptwiki.org> ; foaf:isPrimaryTopicOf <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conc <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F3893255 foaf:primaryTopic <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7 linked-data:definition <https://beta.openphacts.org/api-config> ; msg0:activeLens "Default" ; void:linkPredicate skos:exactMatch ; linked-data:extendedMetadataVersion <https://beta.openphacts.org/1.5/compound?uri=http%3A <http://ops.rsc.org/OPS379634> void:inDataset <http://ops.rsc.org> ; ns0:smiles "CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F" ; ns0:inchi "InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3 ns0:inchikey "MLDQJTXFUGDVEO-UHFFFAOYSA-N" ; 2 . 6
  14. 14. explorer.openphacts.org 3
  15. 15. Architecture 4 . 1
  16. 16. API architecture 4 . 2
  17. 17. Chemical Structure Search RDF/SPARQL (Virtuoso) Identity Mapping Service Identity Resolution Service (ConceptWiki) Chembl, Uniprot, ... Data loading 4 . 3
  18. 18. SC1 Health Webinar Technical overview6 July 2016
  19. 19. Platform goals ◎ Low total cost of ownership ◎ Simple to get started with Big Data ◎ Cater for widely varying use cases ◎ Embrace emerging Big Data technologies ◎ Simple integration with custom components
  20. 20. Key actors
  21. 21. Big Data is ◎ Volume o Quantity of data ◎ Velocity o Speed at which data is provided ◎ Variety o Different formats/models in which data is provided ◎ Veracity o Accuracy/truthfulness of the data Why did we need all this?
  22. 22. Platform architecture
  23. 23. Platform architecture
  24. 24. Platform architecture
  25. 25. Semantic Big Data ongoing research! ◎ Semantic Data Lake o from data swamp to data lake o query contents in the data lake ◎ SANSA stack o Big Data analytics on semantic graph
  26. 26. Support layer ◎ Swarm UI o Launch, install and manage pipelines ◎ Pipeline daemon & monitor o Determine order in which steps are executed o eg: Upload files before running computations ◎ Integrator UI o Present dashboards in a unified interface
  27. 27. Platform architecture
  28. 28. Key actors
  29. 29. Platform installation
  30. 30. Platform installation ◎ Manual installation guide ◎ Using Docker Machine o On local machine (VirtualBox) o In the cloud (AWS, DigitalOcean, Azure) o Bare metal
  31. 31. Platform development
  32. 32. ◎ High level picture o docker-compose.yml describes pipeline topology ◎ Common components o extend template image with your code ◎ New components o build a Docker image for your component o this is your own little Virtual Machine for your component ◎ Sharing o publish topology as git repository o publish new components on docker hub Platform development
  33. 33. Platform development
  34. 34. Deployment
  35. 35. Swarm UI
  36. 36. Swarm UI
  37. 37. Deployment
  38. 38. Swarm UI
  39. 39. Swarm UI
  40. 40. Integrator UI
  41. 41. Workflow UI
  42. 42. More monitoring This topic is ongoing, many interesting options ◎ Visualise logs with Kibana? ◎ Combine logs for large overview? ◎ Monitor node load? ◎ Provide autoscheduling?
  43. 43. Concluding remarks ◎ Used in practice ◎ Easy to get started ◎ Improving as we speak
  44. 44. You can talk to us! ◎ Aad Versteden aad.versteden@tenforce.com ◎ Erika Pauwels erika.pauwels@tenforce.com
  45. 45. Linux Container technology ..light-weight "virtual" virtual machine A container is started from a image Images downloaded from Docker Hub Dockerfile: Layer-based recipe Philosophy: One service, one image → microservices Cloud's best friend: scalable, reproducible, customizable https://www.docker.com/ 5 . 1
  46. 46. https://hub.docker.com/r/openphacts/ 5 . 2
  47. 47. ops-ims ops-mysql ops-virtuoso ops-apiops-memcached ops-virtuosodata ops-mysqldata ops-virtuosostaging ops-mysqlstaging https://data.openphacts.org/ ops-explorer :3001 :3002 :3004:3003 https://hub.docker.com/ ops-docker https://github.com/openphacts/ops-docker/ 5 . 3
  48. 48. Docker Compose https://www.docker.com/products/docker-compose Which images to download Which data volumes to use Which network ports are exposed How are containers linked How to start/stop the containers $ docker-compose up -d 5 . 4
  49. 49. docker-compose.yml # Open PHACTS platform # Docker Compose configuration explorer: image: openphacts/explorer2 ports: - "3001:3000" links: - api environment: - API_URL=http://localhost:3002 #restart: always api: image: openphacts/ops-linkeddataapi ports: - "3002:80" links: - ims - memcached - virtuoso:sparql # SPARQL server virtuoso: build: virtuoso-ops ports: - "3003:8890" volumes_from: - virtuosodata virtuosodata: image: busybox volumes: - /virtuoso 5 . 5
  50. 50. Data staging 6 . 1
  51. 51. Docker and data? Docker Hub maximum image size: 10 GB Open PHACTS data (compressed): ~30 GB Open PHACTS data (installed): ~200 GB Solution: Added staging Docker containers Download from Verify consistency Import into Virtuso and mySQL https://data.openphacts.org/ 6 . 2
  52. 52. https://data.openphacts.org/ 6 . 3
  53. 53. https://data.openphacts.org/ data.openphacts.org RDF datasets RDF linksets VoID metadata/provenance mySQL-imported linksets Virtuoso-imported datasets → Maven repository release data as software →Research Objects propagate metadata 6 . 4
  54. 54. Try it! 7 . 1
  55. 55. https://github.com/openphacts/ops-docker Hardware requirements: 150 GB of disk space (ideal: 250 GB) 16 GB of RAM (ideal: 128 GB) 4 CPU core (ideal: 8 cores) Prerequisites: Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7) Fast Internet connection Docker Docker Compose What do I need? 7 . 2
  56. 56. https://github.com/openphacts/ops-docker Follow the GitHub tutorial exactly, customize later Install latest Docker and Docker Compose Just testing on Windows or OS X? .. modify Docker's Linux VM to have enough disk and memory Firewall? Different settings depend on your firewall details. Don't worry - Docker is containerized! ..you won't break your machine Don't jump ahead.. 7 . 3
  57. 57. https://github.com/openphacts/ops-docker Get the software curl -L https://github.com/openphacts/ops-docker/archive/master.tar.gz | tar xzv cd ops-docker-master sudo docker-compose pull 7 . 4
  58. 58. https://github.com/openphacts/ops-docker Get the data $ sudo docker-compose up --no-recreate -d mysqlstaging virtuosostaging $ sudo docker-compose logs mysqlstaging virtuosostaging ops-mysqlstaging | mySQL staging finished ops-mysqlstaging exited with code 0 ops-virtuosostaging | 09:13:35 --> Backup file # 675 [0x3F02-0x74-0x8A] ops-virtuosostaging | 09:13:36 --> Backup file # 676 [0x3F02-0x74-0x8A] ops-virtuosostaging | 09:13:37 End of restoring from backup, 6751701 pages ops-virtuosostaging | 09:13:37 Server exiting ops-virtuosostaging | Loading completed ops-virtuosostaging exited with code 0 7 . 5
  59. 59. https://github.com/openphacts/ops-docker Start the services $ sudo docker-compose up --no-recreate -d $ sudo docker-compose logs --tail=5 7 . 6
  60. 60. Using the services 8 . 1
  61. 61. http://localhost:3001/ Explorer 8 . 2
  62. 62. http://localhost:3002/ API 8 . 3
  63. 63. http://localhost:3003/ SPARQL 8 . 4
  64. 64. http://localhost:3004/QueryExpander Identity Mapping 8 . 5
  65. 65. What's next? 9 . 1
  66. 66. Custom data staging Different Open PHACTS 2.1 licensing options: Non-Commercial users: Everything Commercial users: No DrugBank, partial SureChembl Open PHACTS members: Full SureChembl 9 . 2
  67. 67. Microservices pr dataset Most queries have separate fragments per dataset ..which could be executed on separate microservices Better cloud scalability Easier to test upgrades of individual datasets But still need "API" layer to do Identity Mapping and selecting datasets to query 9 . 3
  68. 68. BioExcel Workflow blocks BioExcel approach: Spin up virtual machine when an Open PHACTS workflow is started Workflow bound dynamically to VM instance(s) Scalability (exclusive access) Reproducibility (independent/fixed OPS install) Tool descriptions - exposed in bio.tools 9 . 4
  69. 69. Customization Make it easier to add third-party data: datasets, linksets, queries, API calls ..so pharma industry can mix in their in-house data .. so academics can upgrade and expand datasets More tooling, more documentation, or more training? 9 . 5
  70. 70. Feedback https://github.com/openphacts/ops-docker/issues http://support.openphacts.org/ http://ask.bioexcel.eu/ https://data.openphacts.org/ 10

×