Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDE-SC1 Webinar: OpenPHACTS Re-engineered with Big Data Europe


Published on

Watch this webinar on YouTube:

Slides for the latest update on our Big Data Europe pilot in Societal Challenge 1: Health, Demographic Change and Wellbeing.

Last year we successfully completed the first phase of this pilot, replicating the functionality of the Open PHACTS Discovery Platform on the BDE infrastructure. The Open PHACTS Discovery Platform brings together pharmacological data resources in an integrated, interoperable infrastructure, and has been developed to reduce barriers to drug discovery for industry, academia, and small businesses.

Learn more about the progress we’ve made, and what’s coming next.

1. General overview of the Big Data Europe project and Societal Challenges it addresses (Ronald Siebes, VU Amsterdam)

2. The Big Data Europe infrastructure, generic components that are being developed, and their flexibility for different applications (Hajira Jabeen, University of Bonn)

3. Latest details of the current state of the Open PHACTS architecture in BDE, and ongoing work (Nick Lynch, CTO, Open PHACTS Foundation)

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

BDE-SC1 Webinar: OpenPHACTS Re-engineered with Big Data Europe

  1. 1. BIG DATA EUROPE H2020 CSA (2015-17) SC1 – HEALTH CHALLENGE WEBINAR Integrating Big Data, Software & Communities for Addressing Europe’s Societal ChallengesApril 4th 2017 Kiera McNeice, Ronald Siebes, Hajira Jabeen and Nick Lynch
  2. 2. BigDataEurope The 7 Societal Challenges and their first pilots
  3. 3. SC1: Life Sciences & Health SC1: Life Sciences & Health
  4. 4. SC1: Life Sciences & Health
  5. 5. SC1: Life Sciences & Health
  6. 6. SC2: Food & Agriculture SC2: Food & Agriculture
  7. 7. SC2: Food & Agriculture Partners: FAO, the largest autonomous agency within the United Nations system and one of the main players in the agricultural information community. Big Data Focus area: Large-scale distributed agricultural data integration Selected Key Data assets: INFOODS, AQUASTAT Green Learning Network (GLN), Agricultural Bibliography Network (ABN), AgroVoc, AquaMaps, Fishbase Semantic Web Company (SWC) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information management. Their core product is to extract meaning from big data by making use of linked data technologies. Agroknow is a company that captures, organizes and adds value to the rich information available in agricultural and food sciences, in order to make it universally accessible, useful and meaningful.
  8. 8. SC2: Food & Agriculture Pilot focus area: Viticulture (from the Latin word for vine) is the science, production, and study of grapes. It deals with the series of events that occur in the vineyard.
  9. 9. SC2: Food & Agriculture Pilot 2: Support advanced crop data discovery, processing, combining and visualization from distributed and heterogeneous data repositories Vine and Wine sector: emerging market in EU Sustainability and biodiversity challenges: local varieties are being lost Exploitation of new grapevine varieties and clones in terms of climate change adaptation Quality and health status of viticultural products Contribution to human health (antioxidants, prevention of heart diseases etc.) Wide variety of heterogeneous (and big) data from various information sources Reasons:
  10. 10. SC3: Energy SC3: Energy
  11. 11. SC3: Energy Partners: A public entity supervised by the Ministry of Environment, Energy and Climate Change in Greece, founded in September 1987, active in the fields of Renewable Energy Sources (RES), Rational Use of Energy (RUE) and Energy Saving (ES). Big Data Focus area: Real-time turbine monitoring stream processing and analytics Selected Key Data assets: European Energy Exchange Data, smart meter sensor data, gas/fuels market/price data, consumption statistics, stratigraphic model data (geology, geophysics) NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  12. 12. SC3: Energy Pilot focus area: System monitoring in energy production units.
  13. 13. SC3: Energy Pilot 3: Operation, maintenance and production forecasting for wind turbines on real-time sensor data. Current technology is not able to deal with full amount of available valuable data Economic benefit of predicting output and prevention of damage (if one can predict one part about to fail it can be prevented that other parts get damaged) Large continuous stream of sensor data, perfect to test our platform Reasons:
  14. 14. SC4: Transport SC4: Transport
  15. 15. SC4: Transport Partners: The Fraunhofer Society is a German research organization with 67 institutes spread throughout Germany, each focusing on different fields of applied science. Big Data Focus area: Streaming sensor network & geo-spatial data integration Selected Key Data assets: GTFS data, OSM/LinkedGeoData, MobilityMaps, Transport sensor data, ROSATTE Road safety attributes, European Road Data Infrastructure - EuroRoadS The Centre for Research and Technology-Hellas (CERTH) founded in 2000 is one of the leading research centres in Greece. CERTH includes the Hellenic Institute of Transport (HIT): Land, Sea and Air Transportation as well as Sustainable Mobility services ERTICO - ITS Europe is a partnership of around 100 companies and institutions involved in the production of Intelligent Transport Systems (ITS).
  16. 16. SC4: Transport Pilot focus area: Info mobility and traffic planning
  17. 17. SC4: Transport Pilot 4: Multisource data collection for the provision of accurate info- mobility and advanced transport planning service in Thessaloniki, Greece Congestion is a major problem in Europe, especially in urban areas. utilizing real-time probe data for the provision of accurate info-mobility services and advanced transport planning, leads to better decisions The use of mobility data coming from multiple sources presents significant challenges, especially due to the different nature of the datasets both in content and spatio-temporal terms as well as due to the fact that the data should be collected and processed in real time. Reasons:
  18. 18. SC5: Climate SC5: Climate
  19. 19. SC5: Climate Partners: A public entity supervised by the Ministry of Environment, Energy and Climate Change in Greece, founded in September 1987, active in the fields of Renewable Energy Sources (RES), Rational Use of Energy (RUE) and Energy Saving (ES). Big Data Focus area: Enormous simulation time. Extremely complicated computing model. Selected Key Data assets: European Grid Infrastructure (EGI). Access to several data centres hosted at CNRS-Lyon, NCSR-D Athens, INFN-Milan, NIKhEF-Amsterdam. NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  20. 20. SC5: Climate Pilot focus area: Supporting data-intensive climate research
  21. 21. SC5: Climate Pilot 5: Downscaling, and retrieval process on (raw) climate data via User-defined parameters (e.g. geographical areas, time period, physical variables, computational grids, time steps) The provision of Climate model data satisfies an important objective, that of assessing the potential impacts of climate change on well being for adaptation, prevention and mitigation measures and supporting other policy making decisions. The awareness led to the availability of huge datasets Downscaling is a computational intensive process Reasons:
  22. 22. SC6: Social Sciences SC6: Social Sciences
  23. 23. SC6: Social Sciences Partners: CESSDA provides large scale, integrated and sustainable data services to the social sciences. CESSDA is organised as a limited company under Norwegian law owned and financed by the individual EU member states’ ministry of research or a delegated institution. Big Data Focus area: Statistical and research data linking & integration Selected Key Data assets: Federated social sciences data catalogs, statistical data from public data portals and statistical offices (e.g. EuroStats, UNESCO, WorldBank) NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  24. 24. SC6: Social Sciences Pilot focus area: Citizens budget spending on municipal level
  25. 25. SC6: Social Sciences Pilot 6: Citizens budget in municipal level Budget: the most important document of public policy Budget execution affects everyday lives Citizens are more involved in city level Having a platform that integrates heterogeneous budget data (many municipality have their own data formats) and calculates infographics would benefit the citizens, the research community and policy makers Reasons:
  26. 26. SC7: Security SC7: Security
  27. 27. SC7: Security Partners: The Centre supports the decision making of the European Union in the field of the Common Foreign and Security Policy (CFSP), by providing products and services resulting from the exploitation of relevant space assets and collateral data, including satellite imagery and aerial imagery, and related services. NCSR "Demokritos", the largest multidisciplinary research centre of Greece hosts significant scientific research, technological development and educational activities, coordinated by eight Institutes.
  28. 28. SC7: Security Big Data Focus area: Image data analysis Selected Key Data assets: Earth Observation data (e.g. Very High Resolution Satellite Imagery acquired from commercial providers and governmental systems) and collateral data for supporting CFSP/CSDP missions and operations
  29. 29. SC7: Security Pilot focus area: Getting insight in man-made surface changes triggered by automatic detection, news, or social media information
  30. 30. SC7: Security Pilot 7: Ingestion of remote sensing images and social sensing data to detect and verify man-made changes on the Earth surface for security applications Evacuation route planning Monitoring of critical infrastructures Border security Satellite image data is HUGE and computational intensive to compare Smart ‘focus’ algorithms are needed to prioritize the analysis jobs Reasons:
  31. 31. Big Data Europe Integrator Platform Dr Hajira Jabeen, University of Bonn SC1 Webinar
  32. 32. Platform Goals ◎Opensource ◎Simple to get started with Big Data ◎Support a variety of use cases ◎Embrace emerging Big Data technologies ◎Simple integration with custom components
  33. 33. Key actors
  34. 34. Platform Architecture 4
  35. 35. 5 Platform Architecture
  36. 36. Platform Architecture 6
  37. 37. Platform Architecture Support Layer Init Daemon GUIs Monitor App Layer Traffic Forecast Satellite Image Analysis Platform Layer Spark Flink Semantic Layer Ontario SANSA Semagrow Kafka Real-time Stream Monitoring ... ... Resource Management Layer (Swarm) Hardware Layer Premises Cloud (AWS, GCE, MS Azure, …) Data Layer Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
  38. 38. Supported Frameworks Search/indexing Data processing Apache Solr Apache Spark Data acquisition Apache Flink Apache Flume Semantic Components Message passing Strabon Apache Kafka Sextant Data storage GeoTriples Hue Silk Apache Cassandra SEMAGROW ScyllaDB LIMES Apache Hive 4Store Postgis OpenLink Virtuoso 8
  39. 39. BDI Stack Lifecycle
  40. 40. BDI Stack Lifecycle Deploy BDE Platform/Stack to the Cluster
  41. 41. BDI Stack Lifecycle Stack/Cluster Monitor
  42. 42. BDI Stack Lifecycle Developing Custom Applications
  43. 43. BDI Stack Lifecycle Docker Images
  44. 44. BDI Stack Lifecycle BDI Stack (workflow) builder
  45. 45. BDI Stack Lifecycle Custom Components *Init Daemon *Integrator UI
  46. 46. ◎ High level picture o docker-compose.yml describes pipeline topology ◎ BDE provided components o extend template image with your code ◎ New components o build a Docker image for your component o this is your own little Virtual Machine for your component ◎ Sharing o publish topology as git repository o publish new components on docker hub Platform development
  47. 47. Actors ◎Cluster Setup ◎Developer ◎Packaging ◎Stack Composition / Integration ◎Deployment ◎Monitoring 17
  48. 48. Platform installation ◎Manual installation guide ◎Using Docker Machine o On local machine (VirtualBox) o In cloud (AWS, DigitalOcean, Azure) o Bare metal ◎Screencasts 18
  49. 49. Development ◎Base Docker images o Serve as a template for a (Big Data) technology o Easily extendable custom algorithm/data ◎Published components o Image repositories on GitHub o Automated builds on DockerHub o Documentation on BDE Wiki 19
  50. 50. Deploying a Big Data Stack ◎ Stack o collection of communicating components o to solve a specific problem ◎ Described in Docker Compose o Component configuration o Application topology 20
  51. 51. Enhancing the Component ◎ Orchestrator required for initialization process (init_daemon) o Components may depend on each other o Components may require manual intervention ◎ User Interface Integration o Standard Interfaces from components o Combine and align the interfaces 21
  52. 52. User Interfaces ◎Target: Facilitate use of the platform o User Interface Adaption ◎Available interfaces o Workflow UIs ❖ Workflow Builder ❖ Workflow Monitor o Swarm UI o Integrator UI 22
  53. 53. BDE Workflow Builder 23
  54. 54. BDE Workflow Monitor 24
  55. 55. Swarm UI
  56. 56. Swarm UI 26
  57. 57. Integrator UI 27
  58. 58. Beyond the state of the art ... Smart Big Data Increase the value of Big Data by adding meaning to it! 28
  59. 59. Semantic Data Lake (Ontario) ◎Data Swamp o Repository of data in its raw format o Structured, semi-structured, unstructured o Schema-less ◎Data Lake o Add a Semantic layer on top of the source datasets o The data is semantically lifted using existing 29
  60. 60. 31 SANSA Stack
  61. 61. Thank you 32
  62. 62. 33
  63. 63. BDE vs Hadoop distributions Hortonworks Cloudera MapR Bigtop BDE File System HDFS HDFS NFS HDFS HDFS Installation Native Native Native Native lightweight virtualization Plug & play components (no rigid schema) no no no no yes High Availability Single failure recovery (yarn) Single failure recovery (yarn) Self healing, mult. failure rec. Single failure recovery (yarn) Multiple Failure recovery Cost Commercial Commercial Commercial Free Free Scaling Freemium Freemium Freemium Free Free Addition of custom components Not easy No No No Yes Integration testing yes yes yes yes -- Operating systems Linux Linux Linux Linux All Management tool Ambari Cloudera manager MapR Control system - Docker swarm UI+ Custom 34
  64. 64. BDE vs Hadoop distributions ◎BDE is not built on top of existing distributions ◎Targets o Communities o Research institutions ◎Bridges scientists and open data ◎Multi Tier research efforts towards Smart Data 35
  65. 65. Stian Soiland-Reyes, University of Manchester Nick Lynch, CTO Open PHACTS Foundation 4 Apr 2017 Stian Soiland-Reyes, University of Manchester Nick Lynch, CTO Open PHACTS Foundation 4 Apr 2017
  66. 66. Summary 3 • Update on Docker and Open PHACTS • Learnings & transition to AWS • Next Steps & Future Releases
  67. 67. Open PHACTS @dockerhub 14
  68. 68. Open PHACTS Next Steps 34 • Data Refresh planned API 2.2: –Phase 1: ChEMBL, WikiPathways, Uniprot + Chemistry Refreshed (RDF and linksets) –Phases 2 & 3: Remaining data sources –Build data refresh processes • Wider Architecture Review • Science and Open PHACTS Webinar –Science and Open PHACTS: Workflow tools for Life Science Research – 3420450817
  69. 69. Open PHACTS 35 • Custom Data Staging: –Different licensing options to cover Annotated SureChEMBL for members/non members • MicroServices? –Part of Architecture review to discuss future services/API –Interested in experiences of this • Workflow –BioExcel Workflow blocks in development –See