Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDE SC3.3 Workshop - BDE Platform: Technical overview


Published on

BDE Platform: Technical overview (Mr. Ivan Ermilov, University of Leipzig) at the BigDataEurope Workshop, Amsterdam, Novermber 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

BDE SC3.3 Workshop - BDE Platform: Technical overview

  1. 1. BIG DATA EUROPE PLATFORM 3rd BDE Workshop for Energy in WindEurope Conference in Amsterdam28 November 2017 Architecture, Components, How to Use, and Use Cases Ivan Ermilov University of Leipzig/InfAI
  2. 2. Talk outline  The Big Data Integrator platform o Technical Architecture o Components & Interfaces  How To: Installation and Usage  Use Cases 2
  3. 3. BigDataEurope: Summary Horizon2020 project 17 partners 7 pilots in various domains > 30 Big Data components > 350 stars on github 3
  4. 4. Big Data Integrator
  5. 5. Platform Goals  Open source  Simple to get started with Big Data  Support a variety of use cases  Embrace emerging Big Data technologies  Simple integration with custom components 5
  6. 6. 6
  7. 7. Architecture  Big Data Integrator (BDI): o The prototype developed by BDE  Main points of the architecture o Dockerization o Support layer, including integrated UI o Semantification layer 7
  8. 8. Docker containers  Docker offers lightweight virtualization o Docker containers can be shared to be provisioned on different Linux variations and versions  Identical base sys not required  All BDI components: Docker containers 8
  9. 9. Platform Architecture 9
  10. 10. Platform Architecture 10
  11. 11. Platform Architecture 11
  12. 12. Example: Distributed Triplestore 09 October 2017  docker-compose up  docker stack deploy 13
  13. 13. Example: Distributed Triplestore 09 October 2017  Demo
  14. 14. BDI components  Processing and storage components o Re-used existing Docker containers where available o Dockerized by BDE where not o Ensured all can be provisioned through Docker Swarm  Components by BDE: o Support Layer o Semantic Layer 14
  15. 15. BDE Docker Containers 15
  16. 16. Extraction & Processing 16 Apache Flume Collection and aggregation of data streams Apache Kafka Distributed streaming platform (i.e. messaging) Apache Spark MapReduce framework Apache Flink MapReduce framework Apache Hadoop MapReduce framework Fox RDF extraction from text Silk Link discovery LIMES Link discovery
  17. 17. Data Storage 17 Apache Hadoop Distributed raw data storage Apache Hive SQL over Hadoop Apache HBase Distributed SQL store Postgresql SQL database MongoDB NoSQL Document database Apache Cassandra Distributed database Virtuoso Triplestore Strabon Spatiotemporal triplestore Ontario Semantic Data Lake 4store Triplestore
  18. 18. Data Indexing & Coordination 18 Apache Solr Data index based on Apache Lucene Elasticsearch Distributed search and analytics engine Apache Zookeeper Distributed key/value store Semagrow SPARQL federation engine
  19. 19. Visualization & User Interfaces 19 Apache Zeppelin Interactive notebooks for Spark Spark Notebook Interactive notebooks for Spark Cloudera Hue (FB) Web based filebrowser for HDFS BDI IDE BDE stack of interfaces for Swarm UI, application interfaces etc. Kibana Visualization for Elasticsearch
  20. 20. Big Data Stack  Stack o Collection of components o Solving specific problem  One (or several) docker-compose files o Include all component configuration (12 factor apps) o Communications and application topology 09 October 2017
  21. 21. Example: Interactive Spark  Demo 09 October 2017
  22. 22. BDI Stack Lifecycle App templates Ready-made components, best practices BDI Stack Builder BDI Support Components, Instructions Swarm UI, docker-compose Pipeline/Logging Monitor 22
  23. 23. Platform installation  Manual installation guide Using Docker Machine o On local machine (VirtualBox) o In cloud (AWS, DigitalOcean, Azure) o Bare metal Screencasts 23
  24. 24. BigDataEurope Use Cases24
  25. 25. SC1: Pharmacology research Life Sciences & Health • Extensive toolset developed by OPF and others • Query a large number of datasets, some large • Existing elaborate ingestion and homogenization by the OpenPHACTS Foundation 25
  26. 26. SC1 Pilot: Points Demonstrated Life Sciences & Health • Existing distributed, scalable solution • Based on Virtuoso • Proprietary distributed triple store • Porting to BDI gives flexibility • Using Virtuoso or open source alternatives (in BDE, 4store) without development effort for the superstructure and tools around it 26
  27. 27. SC2: Viticulture resources Food and Agriculture • AgInfra is a major infrastructure for agriculture researchers, serving cross-linked bibliography, data, and processing services • Pilot automates ingestion and thematic classification of publication full texts 27
  28. 28. SC2 Pilot: Points Demonstrated Food and Agriculture • AgInfra: Existing infrastructure for data, metadata, and related services • BDI is deployed as an external infrastructure for processing text (viticulture publications) • Allows storing and processing text at a larger scale than AgInfra can currently manage • The bibliographic metadata is added to AgInfra 28
  29. 29. SC3: Predictive maintenance Energy • Wind turbine monitoring applies computational models to sensor data streams • Models are weekly re- parameterized using week’s data from multiple turbines 29
  30. 30. SC3 Pilot: Points Demonstrated Energy • Existing in-house non-scalable solution for model parameterization • Reliable Fortran software for data analysis • Efficient, but not scalable to data volume • Developing a BDI orchestrator • Re-uses existing software unmodified • Makes it easy to apply in parallel to many datasets and manage the outputs 30
  31. 31. SC4: Traffic conditions estimation Transport • Estimation of real-time traffic conditions in Thessaloniki • Combines: • Traffic modelling from historical data • Current measurements from a taxi fleet of 1200 vehicles 31
  32. 32. SC4 Pilot: Points Demonstrated Transport • New Flink implementations of map matching and traffic prediction algorithms • BDI provides access to varied data sources • PostGIS database with city map • ElasticSearch database of historical data • Kafka stream of real-time data 32
  33. 33. SC4: Architecture 09 October 2017 23
  34. 34. SC5: Climate modelling Climate • Discovering and re-using previously computed derivatives • Lineage annotation: datasets and model parameters used to compute derivative datasets • Finding appropriate past runs avoids repeating weeks-long modelling runs • Preparing modelling experiments • Slicing, transforming, combining datasets into new datasets • Submission to and retrieval from modelling infrastructure 34
  35. 35. SC5 Pilot: Points Demonstrated Climate • Existing infrastructure and stable, reliable software for parallel computation of models • BDI is deployed as an external infrastructure for preparing and managing datasets • BDI offers: • Hive for managing data in a way that can be retrieved and manipulated, rather than file blocks • Cassandra stores structured and textual metadata for searching headers and lineage 35
  36. 36. SC6: Municipality budgets Social Sciences • Ingestion of budget and budget execution data • Multiple municipalities in varied formats and data models • Homogenized data made available for analysis and comparison 36
  37. 37. SC6 Pilot: Points Demonstrated Social Sciences • Existing analytics and visualization tools • Use SPARQL queries to retrieve only the relevant slices of the overall data • BDI is deployed as an ingestion and storage infrastructure for external tools • Ingests and homogenizes a constant flow of JSON, CSV, XML, and other formats following various data models • Exposes data as SPARQL endpoint serving homogenized data, stored in 4store, a scalable, distributed RDF store 37
  38. 38. SC6: Architecture 09 October 2017 25
  39. 39. SC7: Change detection & verification Secure Societies • Events are extracted from text published by news agencies and on social networking sites • Events are geo-located and relevant changes are detected by comparing current and previous satellite images 39
  40. 40. SC7 Pilot: Points Demonstrated Secure Societies • Re-implementation of change detection algorithms for Spark • Parallel orchestrator for text analytics • Re-uses existing software • Scales to many input streams • BDI provides: • Cassandra for text content and metadata • Strabon GIS store for detected change location • Homogeneous access to both for analysis and visualization 40
  41. 41. SC7: Architecture 09 October 2017 27
  42. 42. Closing Remarks42
  43. 43. BDI use cases demonstrated  Flexibility to existing workflows o Drug discovery platform ported to new data storage infrastructure  Scaling out o Sensor data analysis software trivially distributed to multiple streams o Image analysis and traffic pattern algorithms in Spark and Flink  Preprocessing and ingestion o Distributed data architectures pre-process and maintain big data o Not needed simultaneously or reduced to small data 43
  44. 44. Semantic Web and Big Data  Adding triple stores to the Apache ecosystem o Scalability for Semantic Web technologies o Semantics stores for the Apache ecosystem  Advocating RDF technologies for describing processing and data services o The Semantic Data Lake vision builds an RDF layer over Swagger/OAI  Advocating RDF and SPARQL as the lingua franca for Big Data retrieval o Semagrow federations of heterogeneous internal and external data stores 44
  45. 45. Thank you for your attention! Questions?  BigDataEurope action Web site:  Big Data Integrator:  Semagrow:  SANSA: 45