Big Data and Machine Learning with
FIWARE: An Architecture
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Andrés Muñoz (joseandres.munoz@upm.es)
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
Map what we may do here
https://twitter.com/Ronald_vanLoon/status/1171741337579937794
Machine Learning Algorithm
● Each application field may require a different
algorithm
● Some solutions
have high
algorithm
complexity
● Some can be
parallelized in a
cluster (FlinkML )
● Other can use
GPU ( Tensorflow
for example)
● Even each case
could be different
we try to set up
some generic life
cycle.
https://www.datasciencecentral.com/profiles/blogs/a-tour-of-machine-learning-algorithms-1
ML Standard Solution
● Each problem requires an analysis of which ML algorithm suits our
data (so there can never be a full standard solution but something
that covers most cases).
● Later, the training dataset needs to be set up (even self-learning is
suitable for some cases).
● Each problem may be slightly different (“same same but different”).
● We can provide some solutions for some cases and use a proper
dataset (some anonymized datasets available are not suitable for ML
algorithms ).
● The tool to use (Spark, Flink, Tensorflow) depends on the chosen ML
algorithm.
● We are considering Apache Beam (google) as a wrapper over all
these technologies to be added into the FIWARE Cosmos ecosystem.
Simple Smart solutions: Reference
Architecture
Draco
Kurento
Wirecloud
QuantumLeap
Knowage
Flink
CrateDB
Spark Components
apache.org
Spark Scheduler
● Dryad-like DAGs
● Pipelines functions
within a stage
● Cache-aware work
reuse & locality
● Partitioning-aware
to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
Apache Flink
ML Lifecycle
Decision Trees
● A decision tree is just what it says…
● Tree that is used to make decisions
● Kind of like a flow chart
● Each node is a test condition
● Each branch is outcome of test represented by corresponding node
● Leaf nodes contain the final decision
● Simple, simple, simple, …
Random Forest
● Random forest (RF) is generalization of a decision tree
● Decision tree is really, really simple
● Very intuitive and can be highly useful
● So, why do we need to generalize?
● Decision trees tend to overfit data
● Random forest avoids this problem
● But lose some of the intuitive simplicity
Random Forest Regression
https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f
Architecture
Draco: Persisting Context Data to MongoDB
Andrés Muñoz (joseandres.munoz@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
● The Draco Generic Enabler takes care of the data ingestion and
persistence. Is a is an easy to use, powerful, and reliable system for
processing and distributing data. Internally, Draco is based on
Apache NiFi.
● NiFi is a dataflow system based on the concepts of flow-based
programming. It supports powerful and scalable directed graphs of
data routing, transformation, and system mediation logic. It was
built to automate the flow of data between systems.
The Draco GE
Apache Nifi Architecture
17
Draco integration in the FIWARE
ecosystem
MiniNiFI (low profile version)
18
Features
● Based on Apache NiFi.
● NGSI 2 Support both for ingestion and serialization to have full
integration with the Orion Context Broker.
● Several persistent backends :
● MySQL, the well-know relational database manager.
● MongoDB, the NoSQL document-oriented database.
● PostgreSQL, the well-know relational database manager.
● HDFS, Hadoop distributed file system.
● Cassandra, Distributed database.
● CartoDB, for geospatial Data
● Templates for some common scenarios
● Rest API
19
20
Basic Example
Demo
Cosmos: Loading Streaming Data
using Flink and Spark
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
The Cosmos GE
The Cosmos Generic Enabler enables
an easier BigData analysis over
context integrated with some of the
most popular BigData platforms.
Features
✔ Batch Processing
✔ Stream Processing (Real-time)
✔ Direct data ingestion
✔ Direct connection with Orion
✔ Multiple Sinks
Orion
Context
Broker
COSMOS
DB HDFS
Web
service
Interface with the Internet of Things
(IoT), Robots and third-party systems
https://github.com/ging/fiware-cosmos
fiware-cosmos-orion-flink-connector
● https://github.com/ging/fiware-cosmos-orion-flink-connector
● https://github.com/ging/fiware-cosmos-orion-flink-connector-examples
ORION
Context Broker
Flink Cluster
Flink Job (JAR)
orion-flink-connector
HTTP POST (Notification)
HTTP
POST/PUT/PATCH
OrionSource
OrionSink
fiware-cosmos-orion-spark-connector
● https://github.com/ging/fiware-cosmos-orion-spark-connector
● https://github.com/ging/fiware-cosmos-orion-spark-connector-examples
ORION
Context Broker
Spark Cluster
Spark Job (JAR)
orion-spark-connector
HTTP POST (Notification)
HTTP
POST/PUT/PATCH
OrionReceiver
OrionSink
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// Create Orion Source. Receive notifications on port 9001
val eventStream = env.addSource(new OrionSource(9001))
// Process event stream
val processedDataStream = eventStream
.flatMap(event => event.entities)
.map(entity => {
val temp = entity.attrs("temperature").value.asInstanceOf[Number].floatValue()
(entity.id, temp)
})
.keyBy(0)
.timeWindow(Time.seconds(10))
.aggregate(new Average)
// print the results with a single thread, rather than in parallel
processedDataStream.print().setParallelism(1)
env.execute("Temperature avg example")
}
Demo: Average temperature for each entity
Current status
Orion Connector
Orion Source/Receiver + Orion Sink ✔ ✔
RTD Documentation ✔ ✔
Unit Tests ✔ ✔
Examples ✔ ✔
Step-by-step tutorial ✔
Support NGSI LD
Big Data and Machine Learning with
FIWARE: Hands-on
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Andrés Muñoz (joseandres.munoz@upm.es)
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
Predicting supermarket purchases
An use case
Purchase data
How we model each purchase:
● date
● client_id
● supermarket_id
● product_list
○ description
○ n_items
○ price
Data aggregation (for each store)
time day month year weekDay purchases
0 14 1 2016 3 5
1 14 1 2016 3 3
2 14 1 2016 3 4
3 14 1 2016 3 3
4 14 1 2016 3 2
5 14 1 2016 3 8
6 14 1 2016 3 12
7 14 1 2016 3 12
8 14 1 2016 3 23
9 14 1 2016 3 45
10 14 1 2016 3 55
11 14 1 2016 3 37
12 14 1 2016 3 42
13 14 1 2016 3 41
14 14 1 2016 3 38
15 14 1 2016 3 29
16 14 1 2016 3 33
Training our model
Random Forest
Regression
Algorithm
time
day
month
year
weekDay
purchases
Trained
Predictive
Model
Using our model to predict purchases
time
day
month
year
weekDay
?
Trained
Predictive
Model
purchases
Predicted number of purchases
in the given date and time
Architecture
Orion entities and subscriptions
{
"id": "ReqTicketPrediction1",
"type": "ReqTicketPrediction",
"predictionId": {
"value": 0,
"type": "String"
},
"socketId": {
"value": 0,
"type": "String"
},
"year":{
"value": 0,
"type": "Integer"
},
"month":{
"value": 0,
"type": "Integer"
},
"day":{
"value": 0,
"type": "Integer"
},
"time": {
"value": 0,
"type": "Integer"
},
"weekDay": {
"value": 0,
"type": "Integer"
}
}
{
"id": "ResTicketPrediction1",
"type": "ResTicketPrediction",
"predictionId": {
"value": 0,
"type": "String"
},
"socketId": {
"value": 0,
"type": "String"
},
"predictionValue":{
"value": 0,
"type": "Float"
},
"year":{
"value": 0,
"type": "Integer"
},
"month":{
"value": 0,
"type": "Integer"
},
"day":{
"value": 0,
"type": "Integer"
},
"time": {
"value": 0,
"type": "Integer"
}
}
Orion
Context Broker
Dracowww
Spark
Master
9001
3000 5050
ResTicketPrediction1
ReqTicketPrediction1
Get the code!
https://github.com/ging/fiware-global-summit-berlin-2019-ml
Open your laptop
Open your browser: http://localhost:3000
IT WORKS !!!!
git clone https://github.com/ging/fiware-global-summit-berlin-2019-ml
docker-compose up
Demo
Thank you!
http://fiware.org
Follow @FIWARE on Twitter
42

FIWARE Global Summit - Big Data and Machine Learning with FIWARE

  • 1.
    Big Data andMachine Learning with FIWARE: An Architecture Joaquín Salvachúa (joaquin.salvachua@upm.es) Andrés Muñoz (joseandres.munoz@upm.es) Sonsoles López (sonsoles.lopez.pernas@upm.es) Gabriel Huecas (gabriel.huecas@upm.es) Universidad Politécnica de Madrid @jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
  • 2.
    Map what wemay do here https://twitter.com/Ronald_vanLoon/status/1171741337579937794
  • 3.
    Machine Learning Algorithm ●Each application field may require a different algorithm ● Some solutions have high algorithm complexity ● Some can be parallelized in a cluster (FlinkML ) ● Other can use GPU ( Tensorflow for example) ● Even each case could be different we try to set up some generic life cycle. https://www.datasciencecentral.com/profiles/blogs/a-tour-of-machine-learning-algorithms-1
  • 4.
    ML Standard Solution ●Each problem requires an analysis of which ML algorithm suits our data (so there can never be a full standard solution but something that covers most cases). ● Later, the training dataset needs to be set up (even self-learning is suitable for some cases). ● Each problem may be slightly different (“same same but different”). ● We can provide some solutions for some cases and use a proper dataset (some anonymized datasets available are not suitable for ML algorithms ). ● The tool to use (Spark, Flink, Tensorflow) depends on the chosen ML algorithm. ● We are considering Apache Beam (google) as a wrapper over all these technologies to be added into the FIWARE Cosmos ecosystem.
  • 5.
    Simple Smart solutions:Reference Architecture Draco Kurento Wirecloud QuantumLeap Knowage Flink CrateDB
  • 6.
  • 7.
    Spark Scheduler ● Dryad-likeDAGs ● Pipelines functions within a stage ● Cache-aware work reuse & locality ● Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition
  • 8.
  • 9.
  • 10.
    Decision Trees ● Adecision tree is just what it says… ● Tree that is used to make decisions ● Kind of like a flow chart ● Each node is a test condition ● Each branch is outcome of test represented by corresponding node ● Leaf nodes contain the final decision ● Simple, simple, simple, …
  • 11.
    Random Forest ● Randomforest (RF) is generalization of a decision tree ● Decision tree is really, really simple ● Very intuitive and can be highly useful ● So, why do we need to generalize? ● Decision trees tend to overfit data ● Random forest avoids this problem ● But lose some of the intuitive simplicity
  • 12.
  • 13.
  • 14.
    Draco: Persisting ContextData to MongoDB Andrés Muñoz (joseandres.munoz@upm.es) Universidad Politécnica de Madrid @jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
  • 15.
    ● The DracoGeneric Enabler takes care of the data ingestion and persistence. Is a is an easy to use, powerful, and reliable system for processing and distributing data. Internally, Draco is based on Apache NiFi. ● NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. It was built to automate the flow of data between systems. The Draco GE
  • 16.
  • 17.
    Draco integration inthe FIWARE ecosystem MiniNiFI (low profile version) 18
  • 18.
    Features ● Based onApache NiFi. ● NGSI 2 Support both for ingestion and serialization to have full integration with the Orion Context Broker. ● Several persistent backends : ● MySQL, the well-know relational database manager. ● MongoDB, the NoSQL document-oriented database. ● PostgreSQL, the well-know relational database manager. ● HDFS, Hadoop distributed file system. ● Cassandra, Distributed database. ● CartoDB, for geospatial Data ● Templates for some common scenarios ● Rest API 19
  • 19.
  • 20.
  • 21.
    Cosmos: Loading StreamingData using Flink and Spark Sonsoles López (sonsoles.lopez.pernas@upm.es) Universidad Politécnica de Madrid @jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
  • 22.
    The Cosmos GE TheCosmos Generic Enabler enables an easier BigData analysis over context integrated with some of the most popular BigData platforms. Features ✔ Batch Processing ✔ Stream Processing (Real-time) ✔ Direct data ingestion ✔ Direct connection with Orion ✔ Multiple Sinks Orion Context Broker COSMOS DB HDFS Web service Interface with the Internet of Things (IoT), Robots and third-party systems https://github.com/ging/fiware-cosmos
  • 23.
    fiware-cosmos-orion-flink-connector ● https://github.com/ging/fiware-cosmos-orion-flink-connector ● https://github.com/ging/fiware-cosmos-orion-flink-connector-examples ORION ContextBroker Flink Cluster Flink Job (JAR) orion-flink-connector HTTP POST (Notification) HTTP POST/PUT/PATCH OrionSource OrionSink
  • 24.
    fiware-cosmos-orion-spark-connector ● https://github.com/ging/fiware-cosmos-orion-spark-connector ● https://github.com/ging/fiware-cosmos-orion-spark-connector-examples ORION ContextBroker Spark Cluster Spark Job (JAR) orion-spark-connector HTTP POST (Notification) HTTP POST/PUT/PATCH OrionReceiver OrionSink
  • 25.
    def main(args: Array[String]):Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // Create Orion Source. Receive notifications on port 9001 val eventStream = env.addSource(new OrionSource(9001)) // Process event stream val processedDataStream = eventStream .flatMap(event => event.entities) .map(entity => { val temp = entity.attrs("temperature").value.asInstanceOf[Number].floatValue() (entity.id, temp) }) .keyBy(0) .timeWindow(Time.seconds(10)) .aggregate(new Average) // print the results with a single thread, rather than in parallel processedDataStream.print().setParallelism(1) env.execute("Temperature avg example") } Demo: Average temperature for each entity
  • 26.
    Current status Orion Connector OrionSource/Receiver + Orion Sink ✔ ✔ RTD Documentation ✔ ✔ Unit Tests ✔ ✔ Examples ✔ ✔ Step-by-step tutorial ✔ Support NGSI LD
  • 27.
    Big Data andMachine Learning with FIWARE: Hands-on Joaquín Salvachúa (joaquin.salvachua@upm.es) Andrés Muñoz (joseandres.munoz@upm.es) Sonsoles López (sonsoles.lopez.pernas@upm.es) Gabriel Huecas (gabriel.huecas@upm.es) Universidad Politécnica de Madrid @jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
  • 28.
  • 29.
    Purchase data How wemodel each purchase: ● date ● client_id ● supermarket_id ● product_list ○ description ○ n_items ○ price
  • 30.
    Data aggregation (foreach store) time day month year weekDay purchases 0 14 1 2016 3 5 1 14 1 2016 3 3 2 14 1 2016 3 4 3 14 1 2016 3 3 4 14 1 2016 3 2 5 14 1 2016 3 8 6 14 1 2016 3 12 7 14 1 2016 3 12 8 14 1 2016 3 23 9 14 1 2016 3 45 10 14 1 2016 3 55 11 14 1 2016 3 37 12 14 1 2016 3 42 13 14 1 2016 3 41 14 14 1 2016 3 38 15 14 1 2016 3 29 16 14 1 2016 3 33
  • 31.
    Training our model RandomForest Regression Algorithm time day month year weekDay purchases Trained Predictive Model
  • 32.
    Using our modelto predict purchases time day month year weekDay ? Trained Predictive Model purchases Predicted number of purchases in the given date and time
  • 33.
  • 34.
    Orion entities andsubscriptions { "id": "ReqTicketPrediction1", "type": "ReqTicketPrediction", "predictionId": { "value": 0, "type": "String" }, "socketId": { "value": 0, "type": "String" }, "year":{ "value": 0, "type": "Integer" }, "month":{ "value": 0, "type": "Integer" }, "day":{ "value": 0, "type": "Integer" }, "time": { "value": 0, "type": "Integer" }, "weekDay": { "value": 0, "type": "Integer" } } { "id": "ResTicketPrediction1", "type": "ResTicketPrediction", "predictionId": { "value": 0, "type": "String" }, "socketId": { "value": 0, "type": "String" }, "predictionValue":{ "value": 0, "type": "Float" }, "year":{ "value": 0, "type": "Integer" }, "month":{ "value": 0, "type": "Integer" }, "day":{ "value": 0, "type": "Integer" }, "time": { "value": 0, "type": "Integer" } } Orion Context Broker Dracowww Spark Master 9001 3000 5050 ResTicketPrediction1 ReqTicketPrediction1
  • 35.
  • 36.
    Open your laptop Openyour browser: http://localhost:3000 IT WORKS !!!! git clone https://github.com/ging/fiware-global-summit-berlin-2019-ml docker-compose up
  • 37.
  • 38.