FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark

Machine Learning with Cosmos and Spark
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Andrés Muñoz (joseandres.munoz@upm.es)
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE

The data pyramid
Grazzini, Jacopo & Pantisano, Francesco. (2015). Guidelines for scientific evidence provision for policy support based on Big Data and open technologies. 10.2788/329540.

https://mattturck.com/data2020/
9

https://mattturck.com/data2020/
10

Simple Smart solutions
Reference Architecture
Draco
Kurento
Wirecloud
QuantumLeap
Knowage
Flink
CrateDB

Hadoop Ecosystem
A
B C
D
Layer Diagram

Spark Scheduler
● Dryad-like DAGs
● Pipelines functions
within a stage
● Cache-aware work
reuse & locality
● Partitioning-aware
to avoid shufﬂes
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition

ML Standard Solution
● Each problem requires an analysis of which ML algorithm suits
our data (so there can never be a full standard solution but
something that covers most cases).
● Later, the training dataset needs to be set up (even self-learning
is suitable for some cases).
● Each problem may be slightly different (“same same but
different”).
● We can provide some solutions for some cases and use a proper
dataset (some anonymized datasets available are not suitable
for ML algorithms ).
● The tool to use (Spark, Flink, Tensorﬂow) depends on the chosen
ML algorithm.

Machine Learning Algorithms
● Each application ﬁeld
may require a different
algorithm
● Some solutions have high
algorithm complexity
● Some can be parallelized
in a cluster (SparkML)
● Others can use GPU
(Tensorﬂow for example)
● Even if each case is
different, we try to set up
some generic life cycle.
https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/

Decision Trees
● A decision tree is just what it says…
● Tree that is used to make decisions
● Kind of like a ﬂow chart
● Each node is a test condition
● Each branch is outcome of test represented by corresponding node
● Leaf nodes contain the ﬁnal decision
● Simple, simple, simple, …

Random Forest
● Random forest (RF) is generalization
of a decision tree
● Decision tree is really, really simple
● Very intuitive and can be highly
useful
● So, why do we need to generalize?
● Decision trees tend to overﬁt data
● Random forest avoids this problem
● But lose some of the intuitive
simplicity

Random Forest Classiﬁer
https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f

Predicting ﬂight delays
A ML use case with FIWARE and Spark

Generic Enablers needed
● FIWARE Orion Context Broker: Orion Context Broker allows you to
manage the entire lifecycle of context information including
updates, queries, registrations and subscriptions. It is an NGSIv2
server implementation to manage context information and its
availability.
● FIWARE Cosmos: The Cosmos Generic Enabler enables an easier Big
Data analysis over context integrated with some of the most popular
Big Data platforms. It provides a connector that allows to send and
receive NGSI events from/to the Context Broker.
● FIWARE Draco: The Draco Generic Enabler takes care of the data
ingestion and persistence. Is a is an easy to use, powerful, and
reliable system for processing and distributing data. Internally, Draco
is based on Apache NiFi.

Get the code!
https://github.com/ging/fiware-ml-flights

Step 1: Getting the dataset
Dataset
● The dataset contains the data of 90-95% of flights (457,013 flights)
with origin in the USA in 2015 published in the Bureau of
Transportation Statistics.
● Some relevant fields:
● FlightDate: Flight date
● Carrier: Airline
● FlightNum: Flight number
● Origin: Airport of origin
● Dest: Destination airport
● DepDelay: Initial departure delay
● ArrivalDelay: Arrival delay
2015,1,1,1,4,2015-01-01,"AA",19805,"AA","N787AA","1",12478,...,31703,"JFK", ...
2015,1,1,2,5,2015-01-02,"AA",19805,"AA","N795AA","1",12478,...,31703,"JFK", ...
2015,1,1,3,6,2015-01-03,"AA",19805,"AA","N788AA","1",12478,...,31703,"JFK", ...

Step 2: Training our model
Random Forest
Classiﬁer
Algorithm
FlightDate
Carrier
month
FlightNum
Origin
Dest
DepDelay
...
ArrDelay
Trained
Predictive
Model
Training job
All the available algorithms:
https://spark.apache.org/docs/latest/ml-classiﬁcation-regression.html

Step 3: Using our model to predict the
ﬂight arrival delay
Trained
Predictive
Model
ArrDelay
Predicted delay
Prediction job
FlightDate
Carrier
month
FlightNum
Origin
Dest
DepDelay
...

Step 4: User interface (web application)
Web server

Step 5: Orion entities and subscriptions
{
"id": "ReqFlightPrediction1",
"type": "ﬂight",
"FlightNum": {
"type": "int",
"value": 15,
"metadata": {}
},
"Origin": {
"type": "String",
"value": "ATL",
"metadata": {}
},
"Dest": {
"type": "String",
"value": "SFO",
"metadata": {}
},
[...]
"predictionId": {
"type": "String",
"value":"3ba647df",
"metadata": {}
},
"socketId": {
"type": "String",
"value":"23x34qc4",
"metadata": {}
},
}
Orion
Context Broker
Dracowww
Spark
Master
9001
5000 5050
ResFlightPrediction1
ReqFlightPrediction1
Entities
{
"id": "ResFlightPrediction1",
"type": "ﬂight",
"predictionId": {
"type": "String",
"value":"3ba647df",
"metadata": {}
},
"socketId": {
"type": "String",
"value":"23x34qc4",
"metadata": {}
},
"predictionValue": {
"type": "String",
"value":"0",
"metadata": {}
},
}

The complete scenario
Deployment
1
2
3 4
5
6
7
8
9
10

Video demo
https://drive.google.com/ﬁle/d/1qGcMeT1baejt-6u38PnReTTyAnxHV8_L/view?usp=sharing

Run the code yourself!
Open your browser: http://localhost:5000
git clone https://github.com/ging/fiware-ml-flights/
python3 deploy-scenario.py

More examples
Check our last webinar for another use case!

Thank you!
http://ﬁware.org
Follow @FIWARE on Twitter
40

FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark

Similar to FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark (20)

More from FIWARE

More from FIWARE (20)

Recently uploaded

Recently uploaded (20)

FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark