Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICWE2017 BigDataEurope


Published on

A paper about the BigDataEurope Project and Platform has been accepted for publication as a full-paper at ICWE2017 in Technical/Vision track.

Published in: Technology
  • Be the first to comment

ICWE2017 BigDataEurope

  1. 1. BigDataEurope - Supporting the Variety Dimension of Big Data Mohamed Nadjib MAMI - Fraunhofer IAISICWE17 - 06.06.2017
  2. 2. Big Data Europe - the Project ◎ EU Horizon 2020-programme-funded ◎ Coordination & Support action (CSA) Project o Show societal value of Big data to 7 Domains o Lower barrier for using Big Data technologies => BigDataEurope Platform 2
  3. 3. Consortium Partners 3 Consortium of 17 Partners o Industry, SMEs, universities, research institutes, etc.
  4. 4. BDE Europe - The Platform ◎ Integrator of Big Data technologies o Easy to use/get started (plug-and-play) o Flexible, Customisable ◎ Bundles with only Open Source solutions o Data Storage o Message Passing o Data Processing o Data Searching & Publishing ◎ Publicly released in May 2017 4
  5. 5. BDE Platform - Components (some) Search/Indexing Data Processing Apache Solr Apache Spark Elasticsearch Apache Flink Data Acquisition Semantic Components Apache Flume Strabon Message Passing Sextant Apache Kafka GeoTriples Data Storage Silk Apache Hadoop SEMAGROW Apache Cassandra LIMES Apache Hive 4Store Postgis OpenLink Virtuoso 5
  6. 6. BDE Platform - Architecture Support Layer Init Daemon GUIs Base Setup App Layer Traffic Forecast Satellite Image Analysis Platform Layer Spark Flink Semantic Layer Ontario SANSA Semagrow Kafka Real-time Stream Monitoring ... ... Resource Management Layer (Swarm) Hardware Layer Premises Cloud (AWS, GCE, MS Azure, …) Data Layer Hadoop NOSQL Store CassandraElasticsearch ...RDF Store Semantic Data Lake (Unified View) 6
  7. 7. BDE Platform - Hardware & Virtualization ◎ Docker used for packaging and deploying applications ◎ Based on containers: o A lightweight environment to make a piece of software run in isolation ❖ Shares the host operating system kernel (unlike VMs) ❖ Reduces conflicts e.g., versions ◎ Docker Compose: creates multi-container applications 7
  8. 8. BDE Platform - Resource Managements ◎ Swarm (mode) used for managing, scheduling and orchestrating Dockers in multi-node clusters ◎ It provides: o Scalability and Fault Tolerance o Containers interlinking o Log-based monitoring ◎ Separate hardware from software management ◎ Based on Services o Swarm execution unit running a Docker Image 8
  9. 9. BDE Platform - Support Layer ◎ Init Daemon: orchestrates the initialization process of the components (containers of Docker Compose): o Components report their initialization progress o It validates whether a specific component can start o It specifies the dependencies between services o It Indicates where a human interaction is required ◎ Examples: o Wait data to load to HDFS to start a Spark job o Wait Spark Master to successfully start to start a Worker 9
  10. 10. BDE Platform - User Interfaces 10 Component 1 Component 2 Component 3 Pipeline Builder: creates step-by-step dependency pipeline (fed to the init daemon)
  11. 11. BDE Platform - User Interfaces 11 Component 1 Finished Component 2 Finished Component 3 Inprogress Pipeline Monitor: displays the status (not started, running or finished) of components in a running pipeline (retrieved from the init daemon)
  12. 12. BDE Platform - User Interfaces 12 Swarm UI: allows to clone a Git repository containing a pipeline and deploys/controls/monitors it on Swarm
  13. 13. BDE Platform - User Interfaces 13 Integrator UI: displays the dashboard of each running component in a unified interface
  14. 14. BDE Platform - Semantic Layer > Ontario ◎ Data Lake or Swamp? o Repository of data in its original formats o Structured, semi-structured, unstructured o Without unified schema ◎ Semantic Data Lake (Ontario) o Add a Semantic Layer on top of the source datasets ❖ The data is semantically lifted using ontology terms ❖ Provide a uniform view over nonuniform data 14
  15. 15. BDE Platform - Semantic Layer > Ontario 15 SELECT count(distinct(?publication)) AS ?no_of_publications count(?deaths) AS ?no_of_deaths WHERE { ?item a qb:Observation . ?item gho:Country ?country . ?item gho:Disease ?disease . ?item att:unitMeasure gho:Measure . ?item eg:incidence ?deaths . ?country rdfs:label "India" . ?disease rdfs:label "Tuberculosis". ?trial a ct:trials . ?trial ct:condition ?condition . ?trial ct:location ?location . ?trial ct:reference ?publication. ?condition owl:sameAs ?disease . ?location redd:locatedIn ?country . ?publication ct:citation ?citation. } ?item a qb:Observation . ?item gho:Country ?country . ?item gho:Disease ?disease . ?item att:unitMeasure gho:Measure . ?item eg:incidence ?deaths . ?trial a ct:trials . ?trial ct:condition ?condition . ?trial ct:location ?location . ?trial ct:reference ?publication. ?condition owl:sameAs ?disease . ?disease rdfs:label "Tuberculosis". ?country rdfs:label "India" . ?location redd:locatedIn ?country . ?publication ct:citation ?citation. Query “number of distinct publications and number of distinct deaths due to the disease Tuberculosis in India”
  16. 16. BDE Platform - Semantic Layer > Ontario 16 Publications Meta-wrapper Trials Meta-wrapper Conditions Meta-wrapper Observations Meta-wrapper 2. Planning 3. Meta-wrapper invocation Query 1. Query Parsing & Validation
  17. 17. BDE Platform - Semantic Layer > Ontario 17 Publications Meta-wrapper Observations Meta-wrapper Trials Meta-wrapper Wrapper (XML) Wrapper (CSV) Conditions Meta-wrapper Wrapper (RDF) 4. Wrapper Selection & Query Translation ?item gho:Country ?country . ?item gho:Disease ?disease . ... SELECT country, disease, ... FROM Observations Mapping rules ... [Xpath] ... ... [Sparql] ... 5. Query Execution ... [Sparql] ...
  18. 18. BDE Platform - Semantic Layer > SANSA 18 SANSA a Framework for distributed RDF data processing ◎ Read/write Layer: Read and write native RDF/OWL data in distributed storage e.g., Hadoop, Spark (RDD, DataFrames, GraphX), Tensors following different representations & partitioning scheme e.g., graphs, tables ◎ Querying Layer: Query distributed RDF using SPARQL (SPARQL-to-SQL approaches, Virtual Views, Intelligent Indexing, ...)
  19. 19. BDE Platform - Semantic Layer > SANSA 19 ◎ Inference Layer: Derive new facts from existing ones, detect inconsistencies, extract new rules to help in reasoning ◎ Machine Learning Layer: Perform ML or analytics to gain insights for relevant trends, predictions or detection of anomalies from RDF data o Tensor Factorization for e.g. KB completion (testing stage) o Graph Clustering (testing stage) o Association rule mining (evaluation stage) o Semantic Decision trees (idea stage) o Inference in Knowledge Graph Embeddings (idea stage)
  20. 20. BDE Platform - Semantic Layer > Semagrow Semagrow a SPARQL query processing system that federates multiple remote endpoints ◎ Original Semagrow o Optimizes queries transparently o Executes sub-queries in the remote endpoints o Integrates results dynamically in heterogeneous data models o Joins the partial results into the final query answer ◎ Next-gen Semagrow o Support different querying languages o Query planner and execution engine adapted e.g., translate SPARQL to CQL for Cassandra databases 20
  21. 21. BDE Showcases (pilots) 21 SC1 SC2 SC3 SC4 SC5 SC6 SC7 SC1 - Open PHACTS discovery platform relating to biological/medical questions SC2 - Discovery and Linking of Viticulture-relevant information SC3 - System monitoring in energy production units SC4 - Short-Term traffic flow forecasting. SC5 - Supporting data-intensive climate research SC6 - Citizens & Researchers Budget on Municipal Level SC7 - Ingestion of remote sensing images and social sensing data to detect and verify changes on the Earth surface for security applications ◎ 7 Societal Challenges > 7 pilot implementations
  22. 22. Showcase SC1: Health, demographic change and wellbeing ◎ SC1 Implements Open PHACTS Discovery Platform o Integrates and links data from multiple sources: ChEBI, ChEMBL, the Gene Ontology and UniProt (Chemistry, Biological, Medical, etc.) o Explores the relationships between data (compounds, targets, pathways, diseases and tissues) o Data accessed using RESTful-API requests ❖ Translated to SPARQL queries ◎ Technologies used: o 4Store, Memchached, MySQL, Puelia, SWAGGER 22
  23. 23. Showcase SC7: Secure Societies ◎ Detect changes in land cover in satellite images (e.g., monitoring critical infrastructures) ◎ Display geo-located events in news sites and social media (e.g., news articles, social networks) ◎ Three workflows: o Change detection workflow o Event detection workflow o Activation workflow ◎ Technologies used: Apache Spark, Cassandra, Sextant, Semagrow, Strabon, GeoTriples 23
  24. 24. Showcase 2 (SC7): Secure Societies 24 General Architecture of the SC7 Pilot
  25. 25. Showcase 2 (SC7): Secure Societies area and the time interval of interest Satellite Images Compare Images Change detection workflow 25
  26. 26. Showcase 2 (SC7): Secure Societies Event detection workflow Associate names with coordinates Cluster news into events (associate geo-location) 26
  27. 27. Showcase 2 (SC7): Secure Societies Activation detection workflow Areas with changes Summary of events Spatiotemporal RDF store 27
  28. 28. Showcase 2 (SC7): Secure Societies refugee camps located in Zaatari, Jordan 28 News TweetsSelected Area Detected changes
  29. 29. Thanks & Questions? For more info... ◎ Project-related: Simon Scerri ( ◎ Ontario: Mohamed Nadjib Mami ( ◎ SANSA: Jens Lehmann ( ◎ Semagrow: Stasinos Konstantopoulos ( ◎ Pilots (showcases): o SC1: Ronald Siebes ( o SC7: George Papadakis ( o All: Ronald Siebes ( ◎ Github repos: ◎ Website: 29
  30. 30. BDE Platform vs. Hadoop Distributions 30 SFR = Single failure recovery MFR = Multiple failure recovery SF = Self healing