FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Spark or Flink

Big Data and Machine Learning with
FIWARE: An Architecture
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Andrés Muñoz (joseandres.munoz@upm.es)
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE

The data pyramid
Grazzini, Jacopo & Pantisano, Francesco. (2015). Guidelines for scientific evidence provision for policy support based on Big Data and open technologies. 10.2788/329540.

The Data Science Venn Diagram
https://www.kdnuggets.com/2016/03/data-science-puzzle-explained.html

Machine Learning Algorithms
● Each application ﬁeld
may require a different
algorithm
● Some solutions have
high algorithm
complexity
● Some can be
parallelized in a cluster
(SparkML)
● Others can use GPU
(Tensorﬂow for example)
● Even if each case is
different, we try to set
up some generic life
cycle.
https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/

ML Standard Solution
● Each problem requires an analysis of which ML algorithm suits
our data (so there can never be a full standard solution but
something that covers most cases).
● Later, the training dataset needs to be set up (even self-learning
is suitable for some cases).
● Each problem may be slightly different (“same same but
different”).
● We can provide some solutions for some cases and use a proper
dataset (some anonymized datasets available are not suitable
for ML algorithms ).
● The tool to use (Spark, Flink, Tensorﬂow) depends on the chosen
ML algorithm.

Simple Smart solutions: Reference
Architecture
Draco
Kurento
Wirecloud
QuantumLeap
Knowage
Flink
CrateDB

Spark Scheduler
● Dryad-like DAGs
● Pipelines functions
within a stage
● Cache-aware work
reuse & locality
● Partitioning-aware
to avoid shufﬂes
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition

Decision Trees
● A decision tree is just what it says…
● Tree that is used to make decisions
● Kind of like a ﬂow chart
● Each node is a test condition
● Each branch is outcome of test represented by corresponding node
● Leaf nodes contain the ﬁnal decision
● Simple, simple, simple, …

Random Forest
● Random forest (RF) is generalization
of a decision tree
● Decision tree is really, really simple
● Very intuitive and can be highly
useful
● So, why do we need to generalize?
● Decision trees tend to overﬁt data
● Random forest avoids this problem
● But lose some of the intuitive
simplicity

Random Forest Regression
https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f

Draco: Persisting Context Data to MongoDB

● The Draco Generic Enabler takes care of the data ingestion and
persistence. Is a is an easy to use, powerful, and reliable system for
processing and distributing data. Internally, Draco is based on
Apache NiFi.
● NiFi is a dataflow system based on the concepts of flow-based
programming. It supports powerful and scalable directed graphs of
data routing, transformation, and system mediation logic. It was
built to automate the flow of data between systems.
The Draco GE

Draco integration in the FIWARE
ecosystem
MiniNiFI (low proﬁle version)
19

Features
● Based on Apache NiFi.
● NGSI 2 Support both for ingestion and serialization to have full
integration with the Orion Context Broker.
● Several persistent backends :
● MySQL, the well-know relational database manager.
● MongoDB, the NoSQL document-oriented database.
● PostgreSQL, another renowned relational database manager.
● HDFS, Hadoop distributed ﬁle system.
● Cassandra, Distributed database.
● CartoDB, for geospatial Data
● Templates for some common scenarios
● Rest API
20

Cosmos: Loading Streaming Data
using Flink and Spark

The Cosmos GE
The Cosmos Generic Enabler enables
an easier BigData analysis over
context integrated with some of the
most popular BigData platforms.
Features
✔ Batch Processing
✔ Stream Processing (Real-time)
✔ Direct data ingestion
✔ Direct connection with Orion
✔ Multiple Sinks
Orion
Context
Broker
COSMOS
DB HDFS
Web
service
Interface with the Internet of Things
(IoT), Robots and third-party systems
https://github.com/ging/ﬁware-cosmos

fiware-cosmos-orion-flink-connector
● https://github.com/ging/fiware-cosmos-orion-flink-connector
● https://github.com/ging/fiware-cosmos-orion-flink-connector-examples
ORION
Context Broker
Flink Cluster
Flink Job (JAR)
orion-flink-connector
HTTP POST (Notification)
HTTP
POST/PUT/PATCH
OrionSource
OrionSink

fiware-cosmos-orion-spark-connector
● https://github.com/ging/fiware-cosmos-orion-spark-connector
● https://github.com/ging/fiware-cosmos-orion-spark-connector-examples
ORION
Context Broker
Spark Cluster
Spark Job (JAR)
orion-spark-connector
HTTP POST (Notification)
HTTP
POST/PUT/PATCH
OrionReceiver
OrionSink

Current status
Orion Connector
Orion Source/Receiver + Orion Sink ✓ ✓
RTD Documentation ✓ ✓
Examples ✓ ✓
Step-by-step tutorial ✓ ✓
Support NGSI v2 ✓ ✓
Support NGSI LD ✓ ✓

Big Data and Machine Learning with
FIWARE: Hands-on
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)

Predicting supermarket purchases
An use case

Get the code!
https://github.com/ging/fiware-ml-supermarket

The dataset
time day month year weekDay purchases
0 14 1 2016 3 5
1 14 1 2016 3 3
2 14 1 2016 3 4
3 14 1 2016 3 3
4 14 1 2016 3 2
5 14 1 2016 3 8
6 14 1 2016 3 12
7 14 1 2016 3 12
8 14 1 2016 3 23
9 14 1 2016 3 45
10 14 1 2016 3 55
11 14 1 2016 3 37
12 14 1 2016 3 42
13 14 1 2016 3 41
14 14 1 2016 3 38
15 14 1 2016 3 29
16 14 1 2016 3 33
Dataset

Training our model
Random Forest
Regression
Algorithm
time
day
month
year
weekDay
purchases
Trained
Predictive
Model
Training job

Using our model to predict purchases
time
day
month
year
weekDay
?
Trained
Predictive
Model
purchases
Predicted number of purchases
in the given date and time
Prediction job

User interface: web application
Web server

The complete scenario
Deployment
1
2
3 4
5
6
7
8
9
10

Orion entities and subscriptions
{
"id": "ReqTicketPrediction1",
"type": "ReqTicketPrediction",
"predictionId": {
"value": 0,
"type": "String"
},
"socketId": {
"value": 0,
"type": "String"
},
"year":{
"value": 0,
"type": "Integer"
},
"month":{
"value": 0,
"type": "Integer"
},
"day":{
"value": 0,
"type": "Integer"
},
"time": {
"value": 0,
"type": "Integer"
},
"weekDay": {
"value": 0,
"type": "Integer"
}
}
{
"id": "ResTicketPrediction1",
"type": "ResTicketPrediction",
"predictionId": {
"value": 0,
"type": "String"
},
"socketId": {
"value": 0,
"type": "String"
},
"year":{
"value": 0,
"type": "Integer"
},
"month":{
"value": 0,
"type": "Integer"
},
"day":{
"value": 0,
"type": "Integer"
},
"time": {
"value": 0,
"type": "Integer"
},
"predictionValue":{
"value": 0,
"type": "Float"
}
}
Orion
Context Broker
Dracowww
Spark
Master
9001
3000 5050
ResTicketPrediction1
ReqTicketPrediction1
Entities

Open your laptop
Open your browser: http://localhost:3000
git clone https://github.com/ging/fiware-ml-supermarket
docker-compose up

Thank you!
http://ﬁware.org
Follow @FIWARE on Twitter
40

FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Spark or Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Spark or Flink

Similar to FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Spark or Flink (20)

More from FIWARE

More from FIWARE (20)

Recently uploaded

Recently uploaded (20)

FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Spark or Flink