Pilot SC4
BDE Workshop Brussels
14 Sept. 2017
BDE Workshop
Brussels 14.09.2017
Objective of the Pilot SC4
BDE Workshop Brussels
14 Sept. 2017
A scalable, fault-tolerant and
flexible platform based on
open source frameworks that
can process unbounded data
sets.
Microservice Architecture
BDE Workshop Brussels
14 Sept. 2017
Message Broker
BDE Workshop Brussels
14 Sept. 2017
Apache Kafka is a high-throughput
distributed durable messaging
system
Apache Kafka
Kafka Cluster
BDE Workshop Brussels
14 Sept. 2017
Apache Kafka
Stream and Batch
Processor
BDE Workshop Brussels
14 Sept. 2017
Apache Flink is an open source
platform for distributed stream and
batch data processing.
Apache Flink
Flink Cluster
BDE Workshop Brussels
14 Sept. 2017
Apache Flink
Storage and Indexing
BDE Workshop Brussels
14 Sept. 2017
PostGis is a spatial database that
stores the road network data.
Elasticsearch is a distributed open
source document database built on
top of Apache Lucene. It stores the
result of the workflow.
Elasticsearch Cluster
BDE Workshop Brussels
14 Sept. 2017
Pilot Architecture
BDE Workshop Brussels
14 Sept. 2017
BDE Components
BDE Workshop Brussels
14 Sept. 2017
The FCD Pipeline
BDE Workshop Brussels
14 Sept. 2017
Pilot Cluster
BDE Workshop Brussels
14 Sept. 2017
Minimum requirement for fault-
tolerance and scalability
● Cluster of 3 nodes (Docker swarm)
● 4 CPU cores x node
● 1 (Flink) worker x node
● 1 (Flink) slot x CPU core
Max parallelism = 12
Parallelization: map-match subtasks
BDE Workshop Brussels
14 Sept. 2017
1. source()
2. mapMatch()
3. keyBy()/window()/apply()
4. sink()
The subtasks can be distributed in
slots with different parallelism (e.g.
from 1 to 12)
Parallelization: map-match subtasks
BDE Workshop Brussels
14 Sept. 2017
A slot can process all the subtasks in a
pipeline
Parallelization: input and output data
BDE Workshop Brussels
14 Sept. 2017
device_id timestamp lat lon speed orientation transit
The mapMatch subtask keeps the time
order so that the next task
keyBy(road_seg)/window(15’)/apply(avera
ge_speed) will return the correct result
within the time window for each road
segment.
road_seg_id start_date num_vehicles avg_speed
SC4 Pilot Pipeline
BDE Workshop Brussels
14 Sept. 2017
Data Upload
BDE Workshop Brussels
14 Sept. 2017
Producer and Consumer
BDE Workshop Brussels
14 Sept. 2017
Visualization
BDE Workshop Brussels
14 Sept. 2017
The pilot SC4 can process
real-time FCD data for
map-matching and classify
a road segment according
to the traffic level.
Short-term traffic forecast
BDE Workshop Brussels
14 Sept. 2017
Algorithm: Feedforward ANN
Short-term traffic forecast
BDE Workshop Brussels
14 Sept. 2017
Algorithm: Feedforward ANN
Hyperparameters (spatial and
temporal correlation):
Input layer units: (Dd*24*60*Cr)/Tw
Dd = number of days (e.g. working
days, 5 days)
Tw = time window (e.g. 30’)
Cr = connected road segments (e.g. 3)
-> 720 input units
SANSA-Stack: Big Data + Machine
Learning + Semantic Technologies
BDE Workshop Brussels
14 Sept. 2017
SANSA-Stack, part of the BDE project,
and RDF data sets based on semantic
technologies such as LinkedGeoData,
will enable more use cases related to
SC4
Thanks
BDE Workshop Brussels
14 Sept. 2017
BDE project website:
https://www.big-data-europe.eu/
Code repository:
https://github.com/big-data-europe
Contact:
luigi.selmi@iais.fraunhofer.de

BDE_SC4_WS3_6_Luigi Selmi - Pilot SC4

  • 1.
    Pilot SC4 BDE WorkshopBrussels 14 Sept. 2017 BDE Workshop Brussels 14.09.2017
  • 2.
    Objective of thePilot SC4 BDE Workshop Brussels 14 Sept. 2017 A scalable, fault-tolerant and flexible platform based on open source frameworks that can process unbounded data sets.
  • 3.
  • 4.
    Message Broker BDE WorkshopBrussels 14 Sept. 2017 Apache Kafka is a high-throughput distributed durable messaging system Apache Kafka
  • 5.
    Kafka Cluster BDE WorkshopBrussels 14 Sept. 2017 Apache Kafka
  • 6.
    Stream and Batch Processor BDEWorkshop Brussels 14 Sept. 2017 Apache Flink is an open source platform for distributed stream and batch data processing. Apache Flink
  • 7.
    Flink Cluster BDE WorkshopBrussels 14 Sept. 2017 Apache Flink
  • 8.
    Storage and Indexing BDEWorkshop Brussels 14 Sept. 2017 PostGis is a spatial database that stores the road network data. Elasticsearch is a distributed open source document database built on top of Apache Lucene. It stores the result of the workflow.
  • 9.
    Elasticsearch Cluster BDE WorkshopBrussels 14 Sept. 2017
  • 10.
    Pilot Architecture BDE WorkshopBrussels 14 Sept. 2017
  • 11.
    BDE Components BDE WorkshopBrussels 14 Sept. 2017
  • 12.
    The FCD Pipeline BDEWorkshop Brussels 14 Sept. 2017
  • 13.
    Pilot Cluster BDE WorkshopBrussels 14 Sept. 2017 Minimum requirement for fault- tolerance and scalability ● Cluster of 3 nodes (Docker swarm) ● 4 CPU cores x node ● 1 (Flink) worker x node ● 1 (Flink) slot x CPU core Max parallelism = 12
  • 14.
    Parallelization: map-match subtasks BDEWorkshop Brussels 14 Sept. 2017 1. source() 2. mapMatch() 3. keyBy()/window()/apply() 4. sink() The subtasks can be distributed in slots with different parallelism (e.g. from 1 to 12)
  • 15.
    Parallelization: map-match subtasks BDEWorkshop Brussels 14 Sept. 2017 A slot can process all the subtasks in a pipeline
  • 16.
    Parallelization: input andoutput data BDE Workshop Brussels 14 Sept. 2017 device_id timestamp lat lon speed orientation transit The mapMatch subtask keeps the time order so that the next task keyBy(road_seg)/window(15’)/apply(avera ge_speed) will return the correct result within the time window for each road segment. road_seg_id start_date num_vehicles avg_speed
  • 17.
    SC4 Pilot Pipeline BDEWorkshop Brussels 14 Sept. 2017
  • 18.
    Data Upload BDE WorkshopBrussels 14 Sept. 2017
  • 19.
    Producer and Consumer BDEWorkshop Brussels 14 Sept. 2017
  • 20.
    Visualization BDE Workshop Brussels 14Sept. 2017 The pilot SC4 can process real-time FCD data for map-matching and classify a road segment according to the traffic level.
  • 21.
    Short-term traffic forecast BDEWorkshop Brussels 14 Sept. 2017 Algorithm: Feedforward ANN
  • 22.
    Short-term traffic forecast BDEWorkshop Brussels 14 Sept. 2017 Algorithm: Feedforward ANN Hyperparameters (spatial and temporal correlation): Input layer units: (Dd*24*60*Cr)/Tw Dd = number of days (e.g. working days, 5 days) Tw = time window (e.g. 30’) Cr = connected road segments (e.g. 3) -> 720 input units
  • 23.
    SANSA-Stack: Big Data+ Machine Learning + Semantic Technologies BDE Workshop Brussels 14 Sept. 2017 SANSA-Stack, part of the BDE project, and RDF data sets based on semantic technologies such as LinkedGeoData, will enable more use cases related to SC4
  • 24.
    Thanks BDE Workshop Brussels 14Sept. 2017 BDE project website: https://www.big-data-europe.eu/ Code repository: https://github.com/big-data-europe Contact: luigi.selmi@iais.fraunhofer.de