Machine Learning with Cosmos and Spark - 28 October 2020
Corresponding webinar recording: https://youtu.be/isugbtZWU4I
This webinar presents an introduction to data engineering with FIWARE using Apache Spark ready for big data deployments. You will learn how to perform real-time predictions step-by-step through a real use case.
Chapter: Processing
Difficulty: 3
Audience: Core
Speaker: Sonsoles Lopez, Andres Muñoz and Joaquin Salvachua (Universidad Politécnica de Madrid - UPM)
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
FIWARE Wednesday Webinars - Machine Learning with Cosmos and Spark
1. Machine Learning with Cosmos and Spark
Joaquín Salvachúa (joaquin.salvachua@upm.es)
Andrés Muñoz (joseandres.munoz@upm.es)
Sonsoles López (sonsoles.lopez.pernas@upm.es)
Gabriel Huecas (gabriel.huecas@upm.es)
Universidad Politécnica de Madrid
@jsalvachua, @anmunozx, @sonsoleslp, @ghuecas, @FIWARE
2. The data pyramid
Grazzini, Jacopo & Pantisano, Francesco. (2015). Guidelines for scientific evidence provision for policy support based on Big Data and open technologies. 10.2788/329540.
19. Spark Scheduler
● Dryad-like DAGs
● Pipelines functions
within a stage
● Cache-aware work
reuse & locality
● Partitioning-aware
to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
20. ML Standard Solution
● Each problem requires an analysis of which ML algorithm suits
our data (so there can never be a full standard solution but
something that covers most cases).
● Later, the training dataset needs to be set up (even self-learning
is suitable for some cases).
● Each problem may be slightly different (“same same but
different”).
● We can provide some solutions for some cases and use a proper
dataset (some anonymized datasets available are not suitable
for ML algorithms ).
● The tool to use (Spark, Flink, Tensorflow) depends on the chosen
ML algorithm.
21. Machine Learning Algorithms
● Each application field
may require a different
algorithm
● Some solutions have high
algorithm complexity
● Some can be parallelized
in a cluster (SparkML)
● Others can use GPU
(Tensorflow for example)
● Even if each case is
different, we try to set up
some generic life cycle.
https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/
23. Decision Trees
● A decision tree is just what it says…
● Tree that is used to make decisions
● Kind of like a flow chart
● Each node is a test condition
● Each branch is outcome of test represented by corresponding node
● Leaf nodes contain the final decision
● Simple, simple, simple, …
24. Random Forest
● Random forest (RF) is generalization
of a decision tree
● Decision tree is really, really simple
● Very intuitive and can be highly
useful
● So, why do we need to generalize?
● Decision trees tend to overfit data
● Random forest avoids this problem
● But lose some of the intuitive
simplicity
28. Generic Enablers needed
● FIWARE Orion Context Broker: Orion Context Broker allows you to
manage the entire lifecycle of context information including
updates, queries, registrations and subscriptions. It is an NGSIv2
server implementation to manage context information and its
availability.
● FIWARE Cosmos: The Cosmos Generic Enabler enables an easier Big
Data analysis over context integrated with some of the most popular
Big Data platforms. It provides a connector that allows to send and
receive NGSI events from/to the Context Broker.
● FIWARE Draco: The Draco Generic Enabler takes care of the data
ingestion and persistence. Is a is an easy to use, powerful, and
reliable system for processing and distributing data. Internally, Draco
is based on Apache NiFi.
30. Step 1: Getting the dataset
Dataset
● The dataset contains the data of 90-95% of flights (457,013 flights)
with origin in the USA in 2015 published in the Bureau of
Transportation Statistics.
● Some relevant fields:
● FlightDate: Flight date
● Carrier: Airline
● FlightNum: Flight number
● Origin: Airport of origin
● Dest: Destination airport
● DepDelay: Initial departure delay
● ArrivalDelay: Arrival delay
2015,1,1,1,4,2015-01-01,"AA",19805,"AA","N787AA","1",12478,...,31703,"JFK", ...
2015,1,1,2,5,2015-01-02,"AA",19805,"AA","N795AA","1",12478,...,31703,"JFK", ...
2015,1,1,3,6,2015-01-03,"AA",19805,"AA","N788AA","1",12478,...,31703,"JFK", ...
31. Step 2: Training our model
Random Forest
Classifier
Algorithm
FlightDate
Carrier
month
FlightNum
Origin
Dest
DepDelay
...
ArrDelay
Trained
Predictive
Model
Training job
All the available algorithms:
https://spark.apache.org/docs/latest/ml-classification-regression.html
32. Step 3: Using our model to predict the
flight arrival delay
Trained
Predictive
Model
ArrDelay
Predicted delay
Prediction job
FlightDate
Carrier
month
FlightNum
Origin
Dest
DepDelay
...
33. Step 4: User interface (web application)
Web server