GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
Walking on AI/ML for Networking
1.
2. Walking on AI/ML for
Networking
Ph.D Oscar Mauricio Caicedo
Grupo de Ingeniería Telemática
Universidad del Cauca
Colombia
3. Outline
● Unsupervised Learning for Heavy-Hitters Threshold Estimation
● Supervised Learning for Multipath Routing in Software-Defined Data Center
Networks
● Reinforcement Learning for Dynamic and Intelligent Routing
● Research Direction
3
6. ● Duration – time between the first and last packet of the flow
● Size – total number of bytes transmitted in the flow
● Rate – given by the size divided by the duration
● Burstiness – characterization of burst in a flow over time t
(Duque-Torres et al., 2019)
Heavy-Hitters Classification Schemes
6
7. ● Outlier and anomaly detection
● Load balancing
● Alleviating network congestion
(Duque-Torres et al., 2019)
Need for Heavy-Hitter Flows Detection
● HH detections is far from being
obvious
● Threshold estimation is a major
challenge
7
8. (Duque-Torres et al., 2019)
Knowledge Discovery Process
Threshold Estimation based on Knowledge Discovery
Process
8
9. (Duque-Torres et al., 2019)
● Can the Knowledge Discovery Process (KDP) to shed new light
on threshold estimation in HH detection?
● Can we determine an optimal threshold or set of thresholds
that would optimally separate the flows into HHs and non-HHs
in a variety of network and traffic conditions?
Threshold Estimation based on Knowledge Discovery
Process
9
10. ● Duration → tortoises and dragonflies
● Sizes → elephants and mice
● Rate → cheetahs and snails
● Burstiness → porcupines and stingrays
Understanding the Problem
10
12. Work Citations Year Appeared In Threshold Value
Curtis et al. 187 2011 IEEE INFOCOM Size
128KB, 1MB and
100MB
Chiesa et al. 29 2016 IEEE/ACM Trans. Netw. Bandwidth 10%
Moshref et al. 47 2013 ACM HotSDN Bandwidth 10%
Liu et al. 19 2013 IEEE Commun. Lett. Rate 1-10 Mbps
Trestain et al. 14 2013 IFIP/IEEE IM Rate Not specified
Lin et al. 9 2014 IEEE GLOBECOM Size 100 MB
Bi et al. 8 2013 IEEE GLOBECOM Size Adaptive
Understanding the Problem
(Duque-Torres et al., 2019)
12
13. TCP
(Duque-Torres et al., 2019)
Understanding the Data
Dataset Count Average
UNIV1
UNIV2
CAIDA2016
CAIDA2018
6886055
6092
9017007
8225731
710.74
198.84
517.31
1021.20
UDP
Count Average
1453303
9869 978
920417
1618524
390.74
758.24
433.62
674.91
TOTAL
Count Average
9999617
9 99956
9994197
9982640
582.60
749.34
509.13
959
Datasets
13
14. Feature name Feature description Direction
min_ps Packet with minimum size in the flow F & B
max_ps Packet with maximum size in the flow F & B
avg_ps Average packet size F & B
std_ps Standard deviation of packet size F & B
min_piat Minimum packet inter-arrival time F & B
max_piat Maximum packet inter-arrival time F & B
avg_piat Average packet inter-arrival time F & B
std_piat Stand. dev. of packet inter-arrival time F & B
octTotCount Number of transmitted bytes in the flow F & B
pktTotCount Number of transmitted pkts in the flow F & B
flowDur Duration of the flow (in seconds)
Data Preparation
(Duque-Torres et al., 2019)
14
15. ● K-Means was used due to its simplicity, speed, and accuracy
● The optimal number of clusters was defined by Silhouette method
Data Mining
Value Interpretation
Si = 1 The samples are very well clustered
0.5 < Si < 0.7 The samples have a reasonable structure for clustering
Si = 0.5 The samples are probably placed in the wrong cluster
0.26 < Si < 0.5 The samples do not have a substantial structure for clustering
Si = 0 The samples lies between two cluster
Si < 0 The samples are placed in the wrong cluster
(Duque-Torres et al., 2019)
15
17. Data Mining
● K-means along the
two first principal
components
● k = 2
(Duque-Torres et al., 2019)
Dataset Class I Class II Total
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 962
18 056
451 696
563 869
8
1
9
3
293 970
18 057
451 705
563 872
17
18. Data Mining
● K-means along the
two first principal
components
● k = 4
(Duque-Torres et al., 2019)
18
20. Silhouette Analysis - Ambiguous Flows
(Class I, Ambiguous Flows) (Duque-Torres et al., 2019)
Evaluating the Discovered Knowledge
20
21. (Duque-Torres et al., 2019)
Ambiguous Flows (Class I)
Evaluating the Discovered Knowledge
max min average
11 394
18 180
7 601
14 258
1
2
1
1
13.05
108.81
14.7
13.53
281 168
28 4 648
28 465
44 251
149 047
108 230
20775
14 784
1 275
17 376
834
1 356.28
Flow Size [MB]
K Dataset flows
I
UNIV1
UNIV2
CAIDA2016
CAIDA2018
293 002
17 979
449 904
562 273
II
UNIV1
UNIV2
CAIDA2016
CAIDA2018
707
36
1 147
1 199
max min average
0.456
6.5
0.324
0.922
0.000032
0.000064
0.000028
0.000029
0.00629
0.0214
0.00660
0.00558
145.39
28.87
1.34
3.734
1.88
107.610
31.14
21.45
0.306
178.96
37.60
34.08
Number of Packets Flow Duration [s]
max min average
2 643.73
455.32
20.20
22.40
0
0
0
0
6.09
33.87
3.29
2.36
2 643.05
453.71
20.20
22.403
0.379
16.06
0.05
0.030
0.379
339.58
15.65
15.52
For TCP traffic
○ Flow size = 6 kB
○ Packet count = 14 21
22. ● There is no threshold that can
unambiguously classify HH flows
● Significance of network and traffic
characteristics
○ Flow size = 6 kB
○ Packet count = 14
(Duque-Torres et al., 2019)
Evaluating the Discovered Knowledge
22
23. ● Open challenges
○ The time required to collect enough flow details for
reliable classification requires further research
■ Investigate the usefulness of per-flow packet size
distributions for accurate classification
■ Develop a threshold-less approach for HH
classification
Evaluating the Discovered Knowledge
23
25. Supervised Learning
Learning approach Training dataset Problems aimed
Supervised Labeled
Classification and regression
Semi-supervised Incomplete labels
Classification Regression
(Boutaba et al., 2018) 25
26. ● Data Center Networks (DCNs)
○ Significant bandwidth capacity
○ Dynamic traffic
○ Coexistence of many kind of flows
■ Small, short-lived flows → mice
■ Few large, long-lived flows → elephants
(Dixit et al., 2013) (Jain et al., 2009)
Multipath Routing in Software-Defined Data Center
Networks
26
27. ● Equal-Cost MultiPath (ECMP)
○ The most prevalent multipath routing mechanism in DCNs
○ Each switch splits flows over the output ports applying a
function on packet headers
■ E.g., a hash function on the 5-tuple header
(Dixit et al., 2013) (Jain et al., 2009)
Multipath Routing in Software-Defined Data Center
Networks
27
28. Input Port
Output 1
Output 2
ABCD
hash
AD
BC
Hot-spot
A + D << B + C
Flows
(Dixit et al., 2013) (Jain et al., 2009)
● ECMP causes hot-spots in DCNs ⟶ low throughput and high
delay
Multipath Routing in Software-Defined Data Center
Networks
28
29. ● Software-Defined Networking (SDN) for facing ECMP
limitations
○ A global view of the network
○ A logically centralized controller to dynamically make and
install routing decisions
● A DCN using SDN → SDDCN
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
29
30. ● Reactive SDN-based multipath routing
○ Dynamically rescheduling of elephant flows
○ Mouse flows are handle by default routing (e.g., ECMP)
○ Discriminate flows using static thresholds
○ Hot-spots may occur before elephants are detected
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
30
31. ● Proactive SDN/ML-based multicast routing
○ ML-based methods at the controller for proactively identifying
elephants
○ Central collection
■ Per-flow data
■ Sampling-based data
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
31
32. ● Proactive SDN/ML-based multicast routing - controller side
○ Central collection
■ Per-flow data
✗ Heavy traffic overhead and poor scalability
■ Sampling-based data
✗ Delayed and inaccurate flow information
(Boutaba et al., 2018)(Tang et al., 2017) (Chao et al., 2016) (Poupart et al., 2016) (Curtis et al., 2011a-b) (Al-Fares et al., 2010)
Multipath Routing in Software-Defined Data Center
Networks
32
33. NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
● Incremental learning at
server-side
● Proactive, accurately and timely
elephant detection
● Adaptation to varying traffic
conditions
● Low control overhead
● Constantly updated model with
limited resources
NELLY
33
34. (Estrada-Solano et al., 2019)
NELLY (Network Elephant Learner and anaLYzer)
Architecture - High level view
34
37. NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
Packet traces UNI1 UNI2
Duration 65 min 158 min
Packets 19.85 M 100 M
IPv4 % of total traffic 98.98% (mostly TCP) 99.9% (mostly UDP)
IPv4 flows 1.02 M (TCP and UDP) 1.04 M (mostly UDP)
Datasets
37
41. Comparison
NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
Works
Learning
approach
Elephants
detection
False
elephants
Table
occupancy
Control
overhead
Detection
time
Network
modifications*
Poupart et al., 2016 Incremental Very high Very low Very high High Very short None
Tang et al., 2017 Batch High Very low Medium Medium Medium
Hardware of
switches
Chao et al., 2016
Batch and
incremental
Very high High Low Medium Medium None
Curtis et al., 2011 None Perfect None None Very low Long
Software in
servers
NELLY Incremental Very high Very low None Very low Very short
Software in
servers
* Assuming OpenFlow-based networks
41
42. ● Open challenges
○ NELLY implementation as an in-kernel software
○ Evaluate impact cost in servers
○ Processing and memory requirements
○ Threshold definition
NELLY (Network Elephant Learner and anaLYzer)
(Estrada-Solano et al., 2019)
42
43. ● Open challenges
○ ML-based elephant rescheduler at the Controller
■ Predict the rate and duration of elephants
■ Regression or multi-classification?
■ How to define the classes?
○ Reschedule elephants based on prediction
○ Algorithm for allocating elephants to links
NELLY (Network Elephant Learner and anaLYzer)
43
46. ● Network traffic routing is fundamental in networking
○ Selecting a path for packet transmission
● Selection criteria
○ Cost minimization
○ Maximization of link utilization
○ Quality of Service (QoS) provisioning
Network Routing Problem
46
47. ● The efficient selection of the best path through a network is still
an important issue to be addressed and improved
STATIC and DYNAMIC routing
● Dynamic routing
○ Distance vector protocols
○ Link state protocols
Network Routing Problem
47
48. ● Distance vector protocols
○ Routing information is not covered by the router itself
○ Delay in routing information
○ Network overhead
● Link state protocols
○ More memory and processing power
○ Understanding its operation is critical
Network Routing Problem
48
49. ● Dynamic routing
○ More complexity
○ Overhead on the network
○ Very rigid
○ Doesn’t learn from past decisions
Network Routing Problem
49
53. Routing based on Knowledge-Defined Networking
(Casas, 2019)
Q-learning
Architecture Operation
53
54. Routing based on Knowledge-Defined Networking
● Open challenges
○ The convergence time
○ Testing in real large networks
○ Multi-controllers
○ Probing of statistics collection based on RL
○ Collaborative learning for considering sub-networks?
Demo
54
56. 5G Scaling Problem
● Mobile traffic is expected to grow exponentially
● The complexity, heterogeneity and scale of networks have grown
far beyond the limits of manual administration
● NFV is a 5G foundation
● Scaling VNFs pertaining to Network Services and Network Slices
is a big challenge
56
57. ● Scaling in NFV brings cost reduction compared to oversize
capacity for workload peak
● Horizontal scaling dynamically increases/reduces the number of
VNF instances
● Vertical scaling increases/reduces the number of resources
allocated to one or more VNFs
● Most existing solutions scale NFs assuming they operate
isolatedly
5G Scaling Problem
57
58. Work Title Technique
Carella et al.
(2016)
An extensible autoscaling engine for software-based
network functions
Threshold rules
Dutta et al. (2016) QoE-Aware elasticity support in cloud-native 5G systems Threshold rules
Bilal et al. (2016) Dynamic cloud resource scheduling in virtualized 5G
mobile systems
Time series
Alawe et al. (2018) Improving traffic forecasting for 5G core network
scalability: A machine learning approach
Neural networks
Tang et al. (2015) Efficient auto-scaling approach in the telco cloud using
self-learning algorithm
Q-Learning
Alawe et al. (2018) On the scalability of 5G core network Control theory
5G Scaling Problem
Solutions focused on scaling NFs individually -(Demo)
58
59. Solutions that consider scaling NFs as part of compound network
services
Work Title Observation
Wang et al. (2016) Online VNF scaling in data centers It assumes that all NFs must
be scaled
Zhang et al. (2016) Co-scaler: cooperative scaling of software-defined
VNF service function chain
It focuses on flow migration
and scales all NFs
Rankothge et al.
(2017)
Optimizing resource allocation for virtualized
network functions in a cloud center using genetic
algorithms
It scales only NF that is
chosen randomly
Prados-Garzon et al.
(2018)
A queuing based dynamic auto scaling algorithm for
the LTE EPC control plane
It is specific for LTE vEPC
5G Scaling Problem
59
63. Monitoring link
Commercia
l
office
Commercial
office
Data processing
center
Data link
Performance degradation
(e.g., delays, congestion, and loss)
Monitoring performance parameters of the
network as link usage, packet loss and delay
To assure the QoS
“Accurate and timely statistics”
Service provider
Up-to-date view of network
Probing Problem