BIG DATA EUROPE:
PILOTS AND TECHNOLOGIES
BDE SC7 Workshop, Brussels18 October 2016
Vangelis Karkaletsis, NCSR Demokritos
BDE Architecture
 Big Data Integrator (BDI):
o The prototype developed by BDE
 Main points of the architecture
o Dockerization
o Support layer, including integrator UI
o Semantic layer
20-oct.-16www.big-data-europe.eu
BDI components
20-oct.-16www.big-data-europe.eu
 Processing and storage components
o Re-used existing docker containers where available
o Dockerized by BDE where not
o Ensured all can be provisioned through Docker Swarm
 Components by BDE:
o Support Layer
o Semantic Layer
BDE Docker Containers
20-oct.-16www.big-data-europe.eu
 Data serving: HDFS, Cassandra, 4store, PostGIS, Strabon,
Elastic Search, Hive, Semagrow
 Processing: Spark, Flink, Sansa
 Stream ingestion middleware: Flume, Kafka
BigDataEurope Pilots
20-oct.-16www.big-data-europe.eu
SC1: Pharmacology research
20-oct.-16www.big-data-europe.eu
Life
Sciences &
Health
• Extensive toolset
developed by OPF
and others
• Query a large number of datasets, some large
• Existing elaborate ingestion and homogenization by the
OpenPHACTS Foundation
SC1 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Life
Sciences &
Health
• Existing distributed, scalable solution
• Based on Virtuoso, proprietary distributed
database
• Porting to BDI gives flexibility
• Using Virtuoso or a number of open source
alternatives without development effort for the
superstructure and tools around it
• Porting to BDI offers new functionalities
• Logging and system health monitoring
SC2: Viticulture resources
20-oct.-16www.big-data-europe.eu
Food and
Agriculture
• AgInfra is a major infrastructure for agriculture researchers,
serving cross-linked bibliography, data, and processing
services
• Pilot automates
publication ingestion
and thematic
classification
SC2 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Food and
Agriculture
• AgInfra: Existing infrastructure for data and services
that process it
• BDI is deployed as an external infrastructure for
processing text (viticulture publications)
• Allows storing and processing text at a larger scale than
AgInfra can currently manage
• Extracts (smaller) bibliographic metadata from (larger)
full texts to be served by AgInfra
SC3: Predictive maintenance
20-oct.-16www.big-data-europe.eu
Energy
• Wind turbine condition monitoring
applies computational models to
sensor data streams
• Models are weekly re-
parameterized using week’s data
from multiple turbines
SC3 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Energy
• Existing in-house non-scalable solution for model
parameterization
• Reliable Fortran software for data analysis
• Efficient, but not scalable to data volume
• Developing a BDI orchestrator
• Re-uses existing software unmodified
• Makes it easy to apply in parallel to many datasets and
manage the outputs
SC4: Traffic conditions estimation
20-oct.-16www.big-data-europe.eu
Transport
• Estimation of real-time traffic
conditions in Thessaloniki
• Combines:
• Traffic modelling from historical
data
• Current measurements from a taxi
fleet of 1200 vehicles
SC4 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Transport
• New Flink implementations of map matching and traffic
prediction algorithms
• BDI provides access to varied data sources
• PostGIS database with city map
• ElasticSearch database of historical data
• Kafka stream of real-time data
SC5: Climate modelling
20-oct.-16www.big-data-europe.eu
Climate
• Discovering and re-using previously computed
derivatives
• Lineage annotation: datasets and model
parameters used to compute derivative
datasets
• Finding appropriate past runs avoids
repeating weeks-long modelling runs
• Preparing modelling experiments
• Slicing, transforming, combining datasets into new datasets
• Submission to and retrieval from modelling infrastructure
SC5 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Climate
• Existing infrastructure and stable, reliable software for
parallel computation of models
• BDI is deployed as an external infrastructure for
preparing and managing datasets
• BDI offers:
• Hive for managing data in a way that can be retrieved
and manipulated, rather than file blocks
• Cassandra stores structured and textual metadata for
searching headers and lineage
SC6: Municipality budgets
20-oct.-16www.big-data-europe.eu
Social
Sciences
• Ingestion of budget and budget
execution data
• Multiple municipalities in varied
formats and data models
• Homogenized data made
available for analysis and
comparison
SC6 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Social
Sciences
• Existing model, analytics and visualization tools
• Use SPARQL queries to retrieve only the relevant slices of the
overall data
• BDI is deployed as an ingestion and storage infrastructure
• Ingests and homogenizes a constant flow of JSON, CSV, XML,
and other formats following various data models
• Exposes data as SPARQL endpoint serving homogenized data,
stored in 4store, a scalable, distributed RDF store
• Creates an online Dashboard on economic data
SC7: Change detection & verification
20-oct.-16www.big-data-europe.eu
Secure
Societies
• Events are extracted from text
published by news agencies and on
social networking sites
• Events are geo-located and relevant
changes are detected by comparing
current and previous satellite images
SC7 Pilot: Points Demonstrated
20-oct.-16www.big-data-europe.eu
Secure
Societies
• Re-implementation of change detection algorithms for
Spark
• Parallel orchestrator for text analytics
• Re-uses existing software
• Scales to many input streams
• BDI provides:
• Cassandra for text content and metadata
• Strabon GIS store for detected change location
• Homogeneous access to both for analysis and
visualization
Closing Remarks
20-oct.-16www.big-data-europe.eu
Questions?
20-oct.-16www.big-data-europe.eu
 BigDataEurope Web site:
https://www.big-data-europe.eu
 Big Data Integrator:
https://github.com/big-data-europe
 Thank you for your attention!

SC7 Workshop 2: Big Data Technologies and Scenarios

  • 1.
    BIG DATA EUROPE: PILOTSAND TECHNOLOGIES BDE SC7 Workshop, Brussels18 October 2016 Vangelis Karkaletsis, NCSR Demokritos
  • 2.
    BDE Architecture  BigData Integrator (BDI): o The prototype developed by BDE  Main points of the architecture o Dockerization o Support layer, including integrator UI o Semantic layer 20-oct.-16www.big-data-europe.eu
  • 3.
    BDI components 20-oct.-16www.big-data-europe.eu  Processingand storage components o Re-used existing docker containers where available o Dockerized by BDE where not o Ensured all can be provisioned through Docker Swarm  Components by BDE: o Support Layer o Semantic Layer
  • 4.
    BDE Docker Containers 20-oct.-16www.big-data-europe.eu Data serving: HDFS, Cassandra, 4store, PostGIS, Strabon, Elastic Search, Hive, Semagrow  Processing: Spark, Flink, Sansa  Stream ingestion middleware: Flume, Kafka
  • 5.
  • 6.
    SC1: Pharmacology research 20-oct.-16www.big-data-europe.eu Life Sciences& Health • Extensive toolset developed by OPF and others • Query a large number of datasets, some large • Existing elaborate ingestion and homogenization by the OpenPHACTS Foundation
  • 7.
    SC1 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Life Sciences & Health • Existing distributed, scalable solution • Based on Virtuoso, proprietary distributed database • Porting to BDI gives flexibility • Using Virtuoso or a number of open source alternatives without development effort for the superstructure and tools around it • Porting to BDI offers new functionalities • Logging and system health monitoring
  • 8.
    SC2: Viticulture resources 20-oct.-16www.big-data-europe.eu Foodand Agriculture • AgInfra is a major infrastructure for agriculture researchers, serving cross-linked bibliography, data, and processing services • Pilot automates publication ingestion and thematic classification
  • 9.
    SC2 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Food and Agriculture • AgInfra: Existing infrastructure for data and services that process it • BDI is deployed as an external infrastructure for processing text (viticulture publications) • Allows storing and processing text at a larger scale than AgInfra can currently manage • Extracts (smaller) bibliographic metadata from (larger) full texts to be served by AgInfra
  • 10.
    SC3: Predictive maintenance 20-oct.-16www.big-data-europe.eu Energy •Wind turbine condition monitoring applies computational models to sensor data streams • Models are weekly re- parameterized using week’s data from multiple turbines
  • 11.
    SC3 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Energy • Existing in-house non-scalable solution for model parameterization • Reliable Fortran software for data analysis • Efficient, but not scalable to data volume • Developing a BDI orchestrator • Re-uses existing software unmodified • Makes it easy to apply in parallel to many datasets and manage the outputs
  • 12.
    SC4: Traffic conditionsestimation 20-oct.-16www.big-data-europe.eu Transport • Estimation of real-time traffic conditions in Thessaloniki • Combines: • Traffic modelling from historical data • Current measurements from a taxi fleet of 1200 vehicles
  • 13.
    SC4 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Transport • New Flink implementations of map matching and traffic prediction algorithms • BDI provides access to varied data sources • PostGIS database with city map • ElasticSearch database of historical data • Kafka stream of real-time data
  • 14.
    SC5: Climate modelling 20-oct.-16www.big-data-europe.eu Climate •Discovering and re-using previously computed derivatives • Lineage annotation: datasets and model parameters used to compute derivative datasets • Finding appropriate past runs avoids repeating weeks-long modelling runs • Preparing modelling experiments • Slicing, transforming, combining datasets into new datasets • Submission to and retrieval from modelling infrastructure
  • 15.
    SC5 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Climate • Existing infrastructure and stable, reliable software for parallel computation of models • BDI is deployed as an external infrastructure for preparing and managing datasets • BDI offers: • Hive for managing data in a way that can be retrieved and manipulated, rather than file blocks • Cassandra stores structured and textual metadata for searching headers and lineage
  • 16.
    SC6: Municipality budgets 20-oct.-16www.big-data-europe.eu Social Sciences •Ingestion of budget and budget execution data • Multiple municipalities in varied formats and data models • Homogenized data made available for analysis and comparison
  • 17.
    SC6 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Social Sciences • Existing model, analytics and visualization tools • Use SPARQL queries to retrieve only the relevant slices of the overall data • BDI is deployed as an ingestion and storage infrastructure • Ingests and homogenizes a constant flow of JSON, CSV, XML, and other formats following various data models • Exposes data as SPARQL endpoint serving homogenized data, stored in 4store, a scalable, distributed RDF store • Creates an online Dashboard on economic data
  • 18.
    SC7: Change detection& verification 20-oct.-16www.big-data-europe.eu Secure Societies • Events are extracted from text published by news agencies and on social networking sites • Events are geo-located and relevant changes are detected by comparing current and previous satellite images
  • 19.
    SC7 Pilot: PointsDemonstrated 20-oct.-16www.big-data-europe.eu Secure Societies • Re-implementation of change detection algorithms for Spark • Parallel orchestrator for text analytics • Re-uses existing software • Scales to many input streams • BDI provides: • Cassandra for text content and metadata • Strabon GIS store for detected change location • Homogeneous access to both for analysis and visualization
  • 20.
  • 21.
    Questions? 20-oct.-16www.big-data-europe.eu  BigDataEurope Website: https://www.big-data-europe.eu  Big Data Integrator: https://github.com/big-data-europe  Thank you for your attention!