Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

523 views

Published on

Big Data Europe Platform

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data components : BDI technical overview

  1. 1. Big Data Europe : Empowering Communities with Data Technologies Dr Hajira Jabeen, University of Bonn Apache Big Data Europe - Seville
  2. 2. Structure ◎Evolution of BDE Architecture ◎BDE Stack ◎Beyond the State of the Art o Workflows (Support Layer) o User Interface o Data Lake (Ontario) o Semantics Analytics Stack 2
  3. 3. Architectural design 1
  4. 4. Architectural design 2 4
  5. 5. Architectural design 3 (released) 5
  6. 6. Architecture for SC 7 6 Stack
  7. 7. Supported Components Search/indexing Data processing Apache Solr Apache Spark Data acquisition Apache Flink Apache Flume Semantic Components Message passing Strabon Apache Kafka Sextant Data storage GeoTriples Hue Silk Apache Cassandra SEMAGROW ScyllaDB LIMES Apache Hive 4Store Postgis OpenLink Virtuoso 7
  8. 8. User profiles 8
  9. 9. Platform installation ◎Manual installation guide ◎Using Docker Machine o On local machine (VirtualBox) o In cloud (AWS, DigitalOcean, Azure) o Bare metal ◎Screencasts 9
  10. 10. Developing a component ◎Base Docker images o Serve as a template for a (Big Data) technology o Easily extendable custom algorithm/data ◎Published components o Responsibilities divided b/w partners o Image repositories on GitHub o Automated builds on DockerHub o Documentation on BDE Wiki 10
  11. 11. Deploying a Big Data Stack ◎Stack collection of communicating components to solve a specific problem ◎Described in Docker Compose o Component configuration o Application topology ◎Orchestrator required for initialization process o Components may depend on each other o Components may require manual intervention 11
  12. 12. User Interfaces ◎Target: Facilitate use of the platform o User Interface Adaption ◎Available interfaces o Workflow UIs ❖ Workflow Builder ❖ Workflow Monitor o Swarm UI o Integrator UI 12
  13. 13. BDE Workflow Builder 13
  14. 14. BDE Workflow Monitor 14
  15. 15. Swarm UI 15
  16. 16. Integrator UI 16
  17. 17. Beyond the State of the Art 17
  18. 18. Smart Big Data Increase Big Data value by adding meaning to it! 18
  19. 19. Orchestration ◎ Goal: flexible composition of Big Data pipelines ◎ Microservice cooperation using shared vocabulary ◎ Uses HTTP requests from a web based frontend ◎ Meta information stored in a triple store ◎ Microservice uses triple store as service backend 19
  20. 20. Semantic Data Lake (Ontario) ◎Repository of data in its raw format o Structured, semi-structured, unstructured ◎Schema-less o No schema is defined on write, it is defined only on read ◎Open to any kind of processing 20
  21. 21. Data Lake 21 Source: Big Data @ Microsoft - Raghu Ramakrishnan
  22. 22. Semantic Data Lake (Ontario) ◎Add a Semantic layer on top of the source datasets o Semantic data is handled as-is o Non-Semantic data is semantically lifted using existing ontology terms 22
  23. 23. Ontario: Architecture 23 Translate and execute Query via Source-specific Access Method Decompose to Source-specific Entities Decompose SPARQL Query
  24. 24. Semantic Analytics Stack (SANSA) 24
  25. 25. SANSA: Motivation ◎ Abundant machine readable structured information is available (e.g. in RDF) o Across SCs, e.g. Life Science Data (OpenPhacts) o General: DBpedia, Google knowledge graph o Social graphs: Facebook, Twitter ◎ Need for scalable querying, inference and machine learning o Link prediction o Knowledge base completion o Predictive analytics 25
  26. 26. SANSA Stack 27
  27. 27. SANSA: Read Write Layer ◎Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces ◎Represent data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors) ◎Allow transformation among these formats ◎Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats 28
  28. 28. SANSA: Query Layer ◎To make generic queries efficient and fast using: o Intelligent indexing o Splitting strategies o Distributed Storage o Distributed/ Federated Querying ◎Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views) ◎Provision of W3C SPARQL compliant 29
  29. 29. SANSA: Inference Layer ◎W3C Standards for Modelling: RDFS and OWL ◎Parallel in-memory inference via rule- based forward chaining ◎Beyond state of the art: dynamically build a rule dependency graph for a rule set ◎→ Adjustable performance levels 30
  30. 30. SANSA: ML Layer ◎Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics ◎Work in Progress: o Tensor Factorization for e.g. KB completion (testing stage) o Simple spatiotemporal analytics (idea stage) o Graph Clustering (testing stage) o Association rule mining (evaluation stage) o Semantic Decision trees (idea stage) 31
  31. 31. Thank you 32 jabeen@iai.uni-bonn.de
  32. 32. 33
  33. 33. BDE vs Hadoop distributions Hortonworks Cloudera MapR Bigtop BDE File System HDFS HDFS NFS HDFS HDFS Installation Native Native Native Native lightweight virtualization Plug & play components (no rigid schema) no no no no yes High Availability Single failure recovery (yarn) Single failure recovery (yarn) Self healing, mult. failure rec. Single failure recovery (yarn) Multiple Failure recovery Cost Commercial Commercial Commercial Free Free Scaling Freemium Freemium Freemium Free Free Addition of custom components Not easy No No No Yes Integration testing yes yes yes yes -- Operating systems Linux Linux Linux Linux All Management tool Ambari Cloudera manager MapR Control system - Docker swarm UI+ Custom 34
  34. 34. BDE vs Hadoop distributions ◎BDE is not built on top of existing distributions ◎Targets o Communities o Research institutions ◎Bridges scientists and open data ◎Multi Tier research efforts towards Smart Data 35

×