SlideShare a Scribd company logo
1 of 67
Data Science for Internet of Things
Case Study: Smart Home Devices
Olivera Kotevska, PhD
kolivera@ieee.org
Goal of this tutorial
 Presents what is Internet of Things (IoT)
 Eco system, protocols, architectures, and challenges
 Presents some of the IoT vulnerabilities
 Vulnerabilities of IoT - security and privacy aspects
 Overview at some of the methods to overcome these vulnerabilities
 Use a practical example of Smart Home network
 Present a set of methods for analyzing IoT network behavior
Outline
 Theoretical part
 IoT systems
 Challenges
 Ways to overcome some of those challenges
 Role of data science in IoT network traffic - Approaches: Pattern detection,
Classification
 Practical part - Use case: Smart Home network
 Setting up the environment using R and RStudio environment
 Set of data science algorithms for IoT network analysis using R & R Studio
 Takeaways
 Few things to remember
 References
Internet of Things (IoT) – history and definition
 First mentioned in 1999 by Kevin Ashton during his work at
Proctor & Gamble.
 IoT is a network of things which enables these things to
connect and exchange data, resulting in efficiency
improvements.
 Things refers to wide variety of objects – devices, systems,
or human.
 The resemblance between all IoT things is the ability to
connect to the Internet.
 IoT has been called the Third Wave in information industry.
https://appsec-labs.com/iot-security/
Areas of IoT
 General classification areas based on target
audience (Individuals, Society, Industry):
 Smart Buildings, Office and Homes
 Smart Energy meters
 Smart Agriculture and Water monitoring
 Water levels
 Smart Manufacturing and Industry
 Industrial asset monitoring
 Smart Mobility and Transport
 Smart Health and Medicine
 Fitness trackers
 Smart Home
 Temperature adjustments
https://brewcitypc.com/home-services/
http://www.exchangecommunications.co.uk
Tenzin, S. et al. (2017) KST, 172-177.
https://www.i-scoop.eu/
The IoT Comic Book is a result of the
EU Internet of Things Initiative.
https://iotcomicbook.org/
Growing number of IoT devices
 General goals of IoT are to:
 Maximize health and safety
 Maximizing the convenience of its execution while minimize the
amount of work
 Minimize the costs
“The number of IoT devices increased 31% year-over-year to 8.4 billion in
2017 and it is estimated that there will be 30 billion devices by 20201.”
1 Nordrum, Amy (18 August 2016). "Popular Internet of Things Forecast of 50 Billion Devices by 2020 Is Outdated". IEEE.
Understanding the IoT eco system
 The whole task of IoT is:
 IoT devices collect the information from the
environment
 knowledge should be extracted from the raw data
 the data will be ready for transfer to other
objects, devices, or servers through the Internet
 IoT includes four main components: sensors,
networks, analyzing data, monitoring the
system.
A high-level system model of IoT
Ammar, M., Russello, G., & Crispo, B. (2018). Internet of Things: A survey on the security of IoT frameworks. Journal of Information Security and Applications, 38, 8-27.
1. IoT Sensors
 Connectivity capabilities
 Reach the outside world (e.g. cloud) directly e.g.
phone
 Others must connect to a hub or gateway first e.g.
smart camera
 Processing capabilities
 Processing on the sensor
 Processing on the hub or cloud
 Combination between both
Gope, P., & Hwang, T. (2016). IEEE Sensors Journal, 16,
1368-1376.
Motion sensor Door & window sensor Raspberry Pi
2. IoT Network
 IoT communication protocols
 Device to Device
 Device to Server
 Server to Server
 Networks characteristics
 All network topologies are supported –
star, mesh, device-to-device
 Wide range of verticals with specific
network requirements
 Wide variety of devices of different
capabilities
Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347-2376.
IoT Network
 Integration of multiple network
standards and protocols
 Short range networks
 RFID, NFC, Bluetooth, Ant, EnOcean, Z-
Wave, Insteon, ZigBee, MiWi, DigiMesh,
WirelessHART, Thread, 7LowPAN, Wi-Fi
 Long range networks
 LoRaWAN, Symphoney Link, Weightless,
SIGFOX, DASH7
BluetoothBluetooth
Wi-Fi
Wi-Fi
ZigBee
https://www.nist.gov/blogs/taking-measure/cybersecuring-internet-things
LoRaWAN
IoT vs Non-IoT network traffic
IoT network traffic
 Environmental characteristics
 heterogeneous, nonlinear, distributed
 Data characteristics
 volume, variety, variability, data types
 Traffic patterns change more frequently
“The availability of IoT traffic not only creates new application opportunity's such as remote
camera monitoring, but it changes the distribution of Internet traffic e.g. recent study shows
that video streaming via Netflix accounts for 32.7% of peak downstream traffic in US.” 1
Non-IoT network traffic
 Environmental characteristics
 homogeneous
 Data characteristics
 volume
 Traffic patterns are more consistent
1 https://www.cnn.com/2011/10/27/tech/web/netflix-internet-bandwith-mashable/index.html
3. IoT Analysis
 Refers to analyzing and examining the data obtained by the IoT
 Key components of collection of IoT data - sensors, network end devices,
and other data storing and transmitting equipment
 It is a service that runs and operationalize sophisticated analytics on
massive volumes of IoT data.
3. IoT Analysis - Traffic characteristics
 Transmission characteristics
 At anytime of the day
 From locations not accessible to humans
 May be coordinated, synchronized
 Periodic or event-driven
 Real time and non-real time
 Sleep time
 Short and small packets
 Properties
 Network size & Message size
 Traffic rate & interval
 Demands on QoS
 Energy source
https://www.ericsson.com/en/mobility-report/massive-iot-in-the-city
4. IoT system monitoring – life cycle
Act
Sense
Commun
icate
Analyze
(Store,
Process)
Visualize
... but there is a dark side – some of IoT challenges
 Hardware
 Cost of devices, battery life, physical specifications
 Communication and networking
 Coverage - High complexity of distributed computing
 Scalability and diversity - Various communication protocols
 Reliability
 Software
 Interoperability, data processing, context awareness
 Security
 C.I.A. – Confidentiality, Integrity, and Availability
 Attack resistance
Zhang, Z. K. et al.(2014, November). IoT security: ongoing challenges and research opportunities. In Service-Oriented Computing and Applications (SOCA), 2014 IEEE 7th International Conference on (pp. 230-234). IEEE.
IoT Security risks
 Number of IoT device and gadgets develops each day so the security danger and
potential difficulties are likewise develops alongside that.
 Potential vulnerabilities are:
 User based
o Node capture and eavesdropping
o Controlling the data
o Access attacks and privacy attacks
 System based
o Distribution and denial-of-service attacks
o Complexity of vulnerability
o Bandwidth constraints
Mendez, D. M., Papapanagiotou, I., & Yang, B. (2017). Internet of things: Survey on security and privacy. arXiv:1707.01879.
Famous cybersecurity attacks
 Massive target hack and the security breach at Equifax which comprised sensitive
personal information. (2017)
 Mirai Botnet (aka Dyn Attack) – affected computers continually search the internet for
vulnerable IoT devices and then use known default usernames and passwords to login,
infecting them with malware. These devices were things like digital cameras and DVR
players. (2016)
 Other attacks related to health and human well-being are:
 In Finland cybercriminals shut down the heating of two buildings in the city. (2016)
 FDA confirmed that St. Jude Medical’s implantable cardiac devices have
vulnerabilities that could allow a hacker to access a device. (2017)
IoT security architecture
Jing, Q., Vasilakos, A. V., Wan, J., Lu, J., & Qiu, D. (2014). Security of the Internet of Things: perspectives and challenges. Wireless Networks, 20(8), 2481-2501.
Application layer characteristics
 Application layer – setting and ensuring common grounds - reaching the
applications - for communication
 Users can access to the IoT through the application layer interface using
TV, PC, mobile equipment and so forth
 It provides the personalized services according to the needs of the users
 Can be thought of as the one that users interact with and as the one witch
defines how process-to-process communications take place
Protocols in different layers
TCP/IP OSI Model Protocols
Application layer
Application layer DNS, DHCP, FTP, HTTP(S), IMAP, NTP, POP3, SMTP, SNMP, TFTP, Telnet,
RTP, RTSP, CoAP, MQTT, XMPP, AMQP
Presentation layer JPEG, MIDI, MPEG, PICT, TIFF
Session layer NetBIOS, NFS, PAP, SCP, SQL, ZIP
Transport layer Transportation layer TCP, UDP
Internet layer Network layer ICMP, IGMP, IPsec, IPv4/IPv6, IPX, RIP
Link layer
Data Link layer ARP, ATM, CDP, FDDI, Frame Relay, HDLC, MPLS, PPP, STP, Token Ring
Physical layer Bluetooth, Ethernet, DSL, ISDN, 802.11 Wi-Fi
Dizdarevic, J. et. al. (2018). Survey of Communication Protocols for Internet-of-Things and Related Challenges of Fog and Cloud Computing Integration. arXiv preprint arXiv:1804.01747.
Protocols are required in order to identify the spoken language of
the IoT devices
IoT data characteristics and challenges
 Data collection challenges
 Smart devices generate data on a continuous manner
 Data generation varies for different devices – processing data
with different generation rates is a challenge
 Heterogeneous sources
 Dynamic nature of data – moving sensors like cars
 Quality of collected data
 Error in measurements or precision of data collection
 Devices’ noise in the environment
 Discrete observation and measurement
 Trustworthy sources
Mahdavinejad, M. S. et al. (2017). Machine
learning for Internet of Things data analysis: A
survey. Digital Communications and Networks.
Analytical capabilities
 Intelligent processing and analysis of the data is the key to developing smart IoT
applications.
 Classification of analytical capabilities consisting of five categories:
 Descriptive – What happened?
 Diagnostic – Why something has happened?
 Discovery – What happened that we don’t know about?
 Predictive – What is likely to happen?
 Prescriptive – What if?
 The need to deal with network traffic patterns, large datasets and
multidimensional spaces of flow and packet attributes is one of the reasons for
using ML in this field.
Siow, E., Tiropanis, T., & Hall, W. (2018). Analytics for the Internet of Things: A Survey. ACM Computing Surveys (CSUR), 51(4), 74.
What is Data Science (DS)?
 DS uses scientific methods, processes, algorithms and systems to extract
knowledge and insights from data
 DS employs techniques and theories from many fields including machine learning (ML)
 ML is a the process of finding and describing structural patterns in a supplied dataset
 ML takes input an the form of dataset of instances
 Each instance is characterized by the values of its features
 Different styles of learning
 Descriptive
 Discovery
 Diagnostic
 Predictive
ML approach to IoT traffic analysis (1/2)
 Essential concepts for determining right algorithms to use
1. IoT application
 Privacy of collected data
 Security parameters such as network security, and data encryption
2. IoT data characteristics
 Volume, velocity, varieties of data, quality of data
3. IoT data analytics algorithms - data-driven vision of ML algorithms
 Analyze the data from variety of sources in real time
 Algorithms that tolerate noisy data
 Algorithms that can work with small amount of labeled data
Xiao, L., Wan, X., Lu, X., Zhang, Y., & Wu, D. (2018). IoT Security Techniques Based on Machine Learning. arXiv preprint arXiv:1801.06275.
Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., & Sheth, A. P. (2017). Machine learning for Internet of Things data analysis: A survey. Digital Communications and Networks.
ML approach to IoT traffic analysis (2/2)
 Traffic measurements in operational networks help to
 understand traffic characteristics in deployed networks
 develop traffic models
 evaluate performance of protocols and applications
 Traffic analysis
 provides information about the user behavior patterns
 enables network operators to understand the behavior of network users
 Traffic prediction
 important to assess future network capacity requirements and to plan future network
developments
Most used ML approaches for IoT traffic analysis
 ML biggest strength in security is training to understand what is "baseline" or
"normal" for a system, and then flagging anything unusual for human review.
 In order to draw the right decision for data analysis, it is necessary to
determine which one of the tasks whether:
 Understanding the data / Pattern detection
 Dynamic over time
 Communication among devices
 Traffic similarity between devices
 Identify the features that characterize the importance of the specific device traffic
 Destination IP, Number of packets
 Classify each device and device type
Pattern detection
 Pattern detection is the automated recognition of
patterns and regularities in data
 Best fitting distributions are determined by:
 Visual inspection of the distribution of the trace and the
candidate distributions
 Statistical test of potential candidates
 Example:
 Correlation analysis between variables, checking the
distribution, extreme value and so forth.
“Characterizing and Classifying IoT Traffic in Smart
Cities and Campuses”, A. Sivanathan et al, 2016
Feature selection
 The quality of feature set is crucial to the performance of a ML algorithm.
𝒳 =
𝒳1
𝒳2
𝒳3
…
 Algorithms
 Filter method – make independent assessment based on general characteristics of the data.
They rely on certain metric to rate and select best subset before the learning. The results should
not be biased toward a particular ML algorithm.
 Wrapper method – evaluate the performance of different subsets using ML algorithms that will
be employed.
 Example: Correlation-based feature selection filter techniques, Regression such as Lasso
(L1), PCA, CCA, Greedy, Best-First/Genetic search.
most suitable features
that characterize the
data set
Classification
 Classification algorithm seeks to classify an object into a finite set of
categories.
 Supervised classifiers aim to build a concise model of the class label
distribution based on features of the classifiable objects.
 Example: Random Forest, Support Vector Machines, Ada Boost, Decision
Trees, Naïve Bayes and so forth.
Classification – Dataset split methodology
Dataset Preprocessing
Training
test 80%
Feature
selection
Validation
test 20%
Training
80%
Model
building
Test
parameter
tuning
Model
testing
Testing
test 20%
Training dataset is a dataset of examples used for learning
Test set is therefore a set of examples used only to assess the performance
Validation dataset is a set of examples used to tune the parameters of a classifier
Hands-on experience - Steps
1. Use case description – Smart Home
2. Description of the dataset and collection process
3. What is R and R Studio? Basic commands and understanding
4. Traffic analysis pipeline:
a. Read the dataset and present the general characteristics of home
network traffic
b. Visualizing the properties of the network files
c. Identifying the patterns
d. Feature selection techniques
e. Device classification
dreamstime.com
1. Use case: Smart Home network
 Smart Home network contains a set of
devices – the network traffic captures the
behavior from unique perspective.
 Combined together these devices they
provide a broad picture of home network
traffic and more importantly reveal
interesting traffic activities in home
networks. http://telano.info/wp-content/uploads/2018/01/house-security-system-intended-for-
my-home-connect-highlights-systems-decor-prices-project-mini-india.jpg
2. Dataset description
 Open dataset from Aalto University School of
Science, Finland 1
 Traffic from 27 smart home IoT devices such as
smart camera, light, coffee maker, iKattle …
1 https://research.aalto.fi/en/datasets/iot-devices-captures(285a9b06-de31-4d8b-88e9-5bdba46cc161).html
Data collection process
 The typical device setup process was repeated 20 times in order to generate
sufficient fingerprints for each device.
 Typically, a setup procedure for a device involved:
 activating the device
 connecting to the device directly over Wi-Fi or Ethernet with the help of a vendor-
provided application
 transmitting Wi-Fi credentials to the user’s network over this connection to the
device, after this
 the device would typically reset and connect to the user’s network using the provided
credentials
3. R & R Studio
 R is a programming language for
statistical computing and graphics
 RStudio is a free and open source
integrated development
environment (IDE) for R
 R packages
Download - https://www.rstudio.com/products/rstudio/download/
Create your first project in R Studio
1. File > New Project > New Directory > New Project > Create directory name and choose the
path to save the project
2. File > New File > R Script
3. File > New File > R Notebook
1. Remove what was written in after ```{r} and before ``` and write print ("Hello there")
2. Execute the program: Preview > Preview Notebook
2. 3.
R basic commands (1/3)
 Variables: x <- 10 or x = 10 also name <- “Camera Front”
 Display the type and length use: mode(x) and length(x)
 List the elements in the memory: ls()
 Writing your own functions
passgen <- function(n, m) {
r <- sample(n:m, 1)
return (pass)
}
 Creating condition:
if (old == pass) pass = passgen(1,10)
else print(paste(“New password is”, pass))
 Creating loops:
while (x > 0) { for (I in 1:length(x)) {
... …
} }
Output:
> x = 10
> name = "Camera Front"
>
> mode(x)
[1] "numeric"
> length(x)
[1] 1
>
> mode(name)
[1] "character"
> length(name)
[1] 1
>
> # random integer number generator
> passgen <- function(n, m) {
+ pass <- sample(n:m, 1)
+ return (pass)
+ }
> pass = passgen(1, 10)
>
> old = 5
> if (old == pass) pass = passgen(1, 10)
else print(paste("New password is ", pass))
[1] "New password is 2"
Paradis, E. 2002. R for Beginners. Montpellier (F): University of Montpellier URL http://cran.r-project.org/doc/contrib/rdebuts_en.pdf
R basic commands (2/3)
 Installing packages in R
install.packages(”curl") or
Tools > Install Packages … > [Type the name in the ’Packages’ box]
 Call the package functionality
library(curl) or
require(curl)
R working with dataset (3/3)
 Data objects: Vector, Factor, Matrix, Data frame, List, Time-series, Expression
 Read dataset
data = read.csv(file, header = TRUE, sep = ",", quote=""", dec=".", fill = TRUE, ...)
 Get the column names of the data frame object names(data)
 The data frame can accessed various ways:
 data$[name of the column] # by field name or data$[name of the column][1:5] # select
elements of it
 data[1,] # first by row or data[1,2] # get one element
 Visualize the data
x <- rnorm(10)
y <- rnorm(10)
plot(x,y)
plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2), ylim=c(-2, 2), pch=22,
col="red", bg="yellow", bty="l", tcl=0.4, main="How to customize a plot with R", las=1, cex=1.5)
Rossiter, D. G. (2012). Introduction to the R Project for Statistical Computing for use at ITC. International Institute for Geo-information Science & Earth Observation (ITC), Enschede (NL), 3, 3-6.
4. Traffic analysis pipeline
Preprocessing
• Check for missing
values (several
techniques)
• Convert IP address to
domain names
• Remove redundant
records
Understanding the data
• Data exploration techniques to
check distribution of the data
• Check and deal with outliers:
statistical methods, wavelets,
Principal Component Analysis or
Partial Least Square
Feature selection
• Statistical analysis
• Visual analysis – data
exploration
• Check correlation
between features
Classification
• Identify destination
address based on
protocol and source
address
• Identify mean IAT
based on median,
min, max, variance
of IAT
• Identify device type
based on the highest
probability
Read dataset (1/2)
 Network traffic is captured in pcap (packet capture) file format
 You can open with Wireshark 1
 Extract the fields from the pcap file using tshark command 2
Open the terminal and type
tshark -nr [path to pcap file] / [name of file].pcap -T fields -e ip.dst -e ip.src -e
ip.proto> length_counts.txt
more length_counts.txt | sort -n | uniq -c > table_lengths.csv
 R code
require(curl)
data <-read.csv(curl("https://raw.githubusercontent.com/CloudDemo/traffic-
analysis/master/dataset_X.csv"))
head(data) # look at the first 10 rows of data
summary(data) # look at the data properties
1 Download: https://www.wireshark.org/download.html
2 Tshark synopsis: https://www.wireshark.org/docs/man-pages/tshark.html
Read dataset (2/2)
Traffic analysis pipeline - Preprocessing
Preprocessing
• Check for missing
values (several
techniques)
• Remove redundant
records
• Convert IP address to
domain names
Understanding the data
• Data exploration techniques to
check distribution of the data
• Check and deal with outliners:
statistical methods, wavelets,
Principal Component Analysis or
Partial Least Square
Feature selection
• Statistical analysis
• Visual analysis – data
exploration
• Check correlation
between features
Classification
• Identify destination
address based on
protocol and source
address
• Identify mean IAT
based on median,
min, max, variance
of IAT
• Identify device type
based on the highest
probability
Preprocessing (1/2)
 Check for missing values
 Remove records with missing values
 Replace the missing fields with column mean of that feature
# Handeling missing data
is.na(data) #returns TRUE of elements in 'data' are missing
newdata <- na.omit(data) # create new dataset without missing data
# Replace missing values with column mean
# Instead of Mean can be used Median, Standard Deviation
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
Preprocessing (2/2)
 Redundant records
 How many and what was the case?
 Four types of network anomalies
can be detected:
 invalid TCP flag combinations
 large number of TCP resets
 UDP and TCP port scans
 traffic volume anomalies
# Removing duplicate rows
duplicated(data) # Find the position
data[!duplicated(data)]
# Another method for removing duplicate rows using 'dplyr'
package
install.packages("dplyr") # Install
library("dplyr") # Load
distinct(data) # Remove duplicate based on all columns
distinct(data, '') # Remove duplicated rows based on ''
Traffic analysis pipeline – Understanding the data
Preprocessing
• Check for missing
values (several
techniques)
• Remove redundant
records
• Convert IP address to
domain names
Understanding the data
• Data exploration techniques to
check distribution of the data
• Check and deal with outliners:
statistical methods, wavelets,
Principal Component Analysis or
Partial Least Square
Feature selection
• Statistical analysis
• Visual analysis – data
exploration
• Check correlation
between features
Classification
• Identify destination
address based on
protocol and source
address
• Identify mean IAT
based on median,
min, max, variance
of IAT
• Identify device type
based on the highest
probability
Understanding the data
 Visualization library – ggplot2
 Types of visualizations (most used):
 Scatter Plot - relationship between two
continuous variables
 Histogram - where and how the data
points are distributed
 Line Chart – progress of the data points
over a period
 Bar Chart – comparison between
categorical variables
 Tree Map – displaying hierarchical data
 Area Chart - how a particular metric
performed compared to a certain
baseline
Flow chart produces by Dr. Andrea Abela, Chairman of the Department of Business & Economics at
the Catholic University of America in Washington, DC
Understanding the data (1/3)
 Network traffic direction can be
categorized in the following groups:
 Flow or uni-directional flow:
 A series of packets sharing the same
file-tuple: source IP, destination IP,
source port, destination port, protocol
 Bi-directional flow:
 Is a pair or unidirectional flows going in
the opposite directions between the
same source and destination IP.
 Full-flow:
 A bi-directional flow captured over its
entire lifetime, from establishment to
the end of the communication
connection.
library(plyr)
library(ggplot2)
heatmap_data = ddply(data,.(App.protocol, Destination.adr, Destination.port,
Source.adr, Source.port),nrow)
head(heatmap_data)
heatmap_data$protocol = factor(heatmap_data$protocol)
ggplot(heatmap_data, aes(x = heatmap_data$protocol)) + geom_bar() +
xlab('Protocol type') + ylab('Total number of packets')
Bar chart visualization
Understanding the data (2/3)
# List all the devices and protocol type usage
heatmap_port = ddply(heatmap_data,.(source.adr,
protocol),nrow)
ggplot(data=heatmap_port, aes(x=Protocol, y=SourceAdr)) +
geom_point() + xlab('Protocol') + ylab('Source IP’)
# List selected devices per MAC address
selected = heatmap_data[heatmap_data$source.adr ==
"74:da:38:80:79:fc" | heatmap_data$source.adr ==
"74:da:38:80:7a:08" | heatmap_data$source.adr ==
"'b0:c5:54:1c:71:85" | heatmap_data$source.adr ==
"b0:c5:54:25:5b:0e" | heatmap_data$source.adr ==
"5c:cf:7f:07:ae:fb" | heatmap_data$source.adr ==
"5c:cf:7f:06:d9:02", ]
heatmap_port = ddply(selected,.(Source.adr,
App.protocol),nrow)
names(heatmap_port) <- c('SourceAdr', 'Protocol', 'Freq’)
ggplot(data=heatmap_port, aes(x=Protocol, y=SourceAdr)) +
geom_point() + xlab('Protocol') + ylab('Source IP')
Scatter plot visualization
D-LinkCam
EdimaxCam
SmarterCoffee
iKettle2
EdimaxCam
Understanding the data (3/3)
 IAT – Inter Arrival Time
 The time between the “start” of
two events.
 In network context the time
between arrival time of two
packages.
previous = 0
IA_times = c()
j = 0
for (i in 1:dim(data)[1]) {
if (previous == 0) {
previous = data[i,4]
} else {
j = j + 1
IA_times[j] = data[i,4] - previous
previous = data[i,4]
}
}
IAT_times <- as.data.frame(IA_times)
ggplot(IAT_times, aes(x=IAT_times$c, y=IAT_times$IA_times)) +
geom_line(color='steelblue') +
xlab('Number of packets') + ylab('Inter Arrival Time')
Line chart visualization
Traffic analysis pipeline – Feature selection
Preprocessing
• Check for missing
values (several
techniques)
• Remove redundant
records
• Convert IP address to
domain names
Understanding the data
• Data exploration techniques to
check distribution of the data
• Check and deal with outliners:
statistical methods, wavelets,
Principal Component Analysis or
Partial Least Square
Feature selection
• Statistical analysis
• Check correlation
between features
• Visual analysis – data
exploration
Classification
• Identify destination
address based on
protocol and source
address
• Identify mean IAT
based on median,
min, max, variance
of IAT
• Identify device type
based on the highest
probability
Feature selection (1/5)
 Pearson Correlation - is a number between -1 and 1 that indicates the
extent to which two variables are linearly related.
 Chi-Square - is used to determine relationship between two variables.
Traffic
observations
Post processing
Traffic analysis
Results
Feature vector
Packet/Flow Data
𝜌 𝜒,𝛾 =
𝑐𝑜𝑣 (𝜒, 𝛾)
𝜎 𝜒 𝜎 𝛾
𝒳2 =
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
Probability of having those results considering the
null hypothesis is true, which is the variables are
not associated.
Feature selection (2/5)
 Filter method analysis using Pearson’s correlation with package ‘caret’
library(caret)
correlationMatrix <- cor(basic.features[,5:7]) # calculate correlation matrix
print(correlationMatrix) # summarize the correlation matrix
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
print(highlyCorrelated) # print indexes of highly correlated attributes
Feature selection (3/5)
 Filter method analysis using Chi-Square correlation with package ‘MASS’
library(MASS)
tbl = table(selected$V1, selected$V2)
print(tbl)
chisq.test(tbl) # is the destination IP independent of the protocol at .05 significance level
Feature selection (4/5)
 Wrapper analysis using regression technique
with package ‘‘gml’
 We used features from TCP/IP Header, packet
time, and IAT value
Fonti, V., & Belitser, E. (2017). Feature selection using lasso. VU Amsterdam Research Paper in Business Analytics.
# Fit a logistic regression model
fit_glm = glm(selected$V1 ~ selected$V5, family =
"binomial”, selected)
summary(fit_glm) # generate summary
Feature selection (5/5)
 For loop going through all the variables as potentials for finding which features is more
relevant
#No. of cols in data frame
c <- ncol(selected)
# Intializing the vector which will contain the p-values of all variables
pvalues <- numeric(c)
# Getting the p-values
for (i in 1:c) {
fit <- glm(selected$V1 ~ selected[,i], selected, family = "binomial")
summ <- summary(fit)
pvalues[i] <- summ$coefficients[2,4]
}
ord <- order(pvalues)
x10 <- selected[,ord]
names(x10)
Traffic analysis pipeline - Classification
Preprocessing
• Check for missing
values (several
techniques)
• Remove redundant
records
• Convert IP address to
domain names
Understanding the data
• Data exploration techniques to
check distribution of the data
• Check and deal with outliners:
statistical methods, wavelets,
Principal Component Analysis or
Partial Least Square
Feature selection
• Statistical analysis
• Visual analysis – data
exploration
• Check correlation
between features
Classification
• Identify destination
address based on
protocol and source
address
• Identify mean IAT
based on median,
min, max, variance
of IAT
• Identify device type
based on the highest
probability
Classification (1/4)
 A common way to characterize a classifier’s accuracy is through metrics known as False Positives
(FP), False Negatives (FN), True Positives (TP), and True Negatives (TN) by using confusion matrix.
 ML literature often utilizes three additional metrics known as Recall, Precision, and F-measure.
Classified as ->
X Y
X TP FN
Y FP TN
Confusion matrix
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑀𝑆𝐸 = 𝑖=1
𝑛
𝑃𝑖 − 𝑂𝑖
2
𝓃
 Another used metric is Root Mean Square Error (RMSE), that measure the differences between
values predicted by the model Pi and observed values Oi.
Classification (2/4)
 Identify protocol used based on the destination address using decision tree
algorithm
# load the required libraries
library(rpart)
suppressPackageStartupMessages(library(Metrics)) # Used for calculating root
mean squared log error
suppressPackageStartupMessages(library(caret)) # For data partition
suppressPackageStartupMessages(library(rpart.plot)) # For plotting the model
selected$V1 = as.integer(selected$V1)
train_set = selected[1:5917,]
test_set = selected[5917:6574,]
model_rf <- rpart(V1 ~ V2,data=train)
cv_prediction_set <- predict(model_rf,cv_set)
Classification (3/4)
 Calculate the error
library(Metrics)
rmse <- sqrt(mean((cv_set$V1 - cv_prediction_set) ^ 2)) or
rmse <- rmse(test_set$V1, prediction_set)
paste("The root mean square error is", rmse,sep=" ")
rpart.plot(model_rf)
Classification (4/4)
 Identify protocol used based on the destination address using Random Forest
aalgorithm
# Random Forest classifier
library(randomForest)
#applying Random Forest
model_rf <- randomForest(V1 ~ V2, data = train_set)
preds <- predict(model_rf,test_set)
table(preds)
accuracy(preds, test_set$V1) #checking accuracy
Take a ways from Theoretical Part
– Things to remember
 IoT systems are more and more present in our daily life and we need to
be informed how to use it and what are the security and privacy risks
 There are methods that can help to protect our privacy and security
 ML techniques can help us to develop a tools that can help to protect our
privacy and security
Take a ways from Practical Part
– Things to remember
 This work presented several steps that should be considered when pre-
processing raw network traffic data for data mining.
 Some of these steps are essential, some others can be optional.
 Provided a case study using a freely available dataset.
 We show basic information of using R and R Studio environment – platform
mostly used by statistician
 We show how to use one of the most widely used ML algorithm Random Forest
 Results show the steps are indeed key to obtaining reliable results.
References
 Wired
 https://www.wired.com/story/firewalls-dont-stop-hackers-ai-might/
 https://www.wired.com/story/ai-machine-learning-cybersecurity/
 RStudio videos and examples
 https://github.com/rstudio/webinars
 Visualizations in R
 Top 50 ggplot2 Visualizations - http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-
Code.html
 Smart Home
 Amar, Y., Haddadi, H., Mortier, R., Brown, A., Colley, J., & Crabtree, A. (2018). An Analysis of Home
IoT Network Traffic and Behaviour. arXiv preprint arXiv:1803.05368.
 Network traffic analysis
 Nguyen, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic classification using
machine learning. IEEE Communications Surveys & Tutorials, 10(4), 56-76.
 Mohammadi, M., Al-Fuqaha, A., Sorour, S., & Guizani, M. (2018). Deep Learning for IoT Big Data and
Streaming Analytics: A Survey. IEEE Communications Surveys & Tutorials.
Thank you a lot!
 Code and dataset are available - https://github.com/CloudDemo/traffic-analysis
 Email: kolivera@ieee.org
Questions

More Related Content

What's hot

Smart city landscape
Smart city landscapeSmart city landscape
Smart city landscapeSamir SEHIL
 
Novel authentication framework for securing communication in internet-of-things
Novel authentication framework for securing communication in internet-of-things Novel authentication framework for securing communication in internet-of-things
Novel authentication framework for securing communication in internet-of-things IJECEIAES
 
A fair survey on Internet of Things(IoT)
A fair survey on Internet of Things(IoT)A fair survey on Internet of Things(IoT)
A fair survey on Internet of Things(IoT)Gokulnath J
 
November 2021 - Top 10 Read Articles in Network Security & Its Applications
November 2021 - Top 10 Read Articles in Network Security & Its ApplicationsNovember 2021 - Top 10 Read Articles in Network Security & Its Applications
November 2021 - Top 10 Read Articles in Network Security & Its ApplicationsIJNSA Journal
 
March 2022 - Top 10 Read Articles in Network Security and Its Applications
March 2022 - Top 10 Read Articles in Network Security and Its ApplicationsMarch 2022 - Top 10 Read Articles in Network Security and Its Applications
March 2022 - Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
October 2020 - Top Read Articles in Network Security & Its Applications
October 2020 - Top Read Articles in Network Security & Its ApplicationsOctober 2020 - Top Read Articles in Network Security & Its Applications
October 2020 - Top Read Articles in Network Security & Its ApplicationsIJNSA Journal
 
December 2021: Top 10 Read Articles in Network Security and Its Applications
December 2021: Top 10 Read Articles in Network Security and Its ApplicationsDecember 2021: Top 10 Read Articles in Network Security and Its Applications
December 2021: Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
Security in the Internet of Things
Security in the Internet of ThingsSecurity in the Internet of Things
Security in the Internet of ThingsBHAVANA KONERU
 
IoT: Ongoing challenges and opportunities in Mobile Technology
IoT: Ongoing challenges and opportunities in Mobile TechnologyIoT: Ongoing challenges and opportunities in Mobile Technology
IoT: Ongoing challenges and opportunities in Mobile TechnologyAI Publications
 
Review on Vulnerabilities of IoT Security
Review on Vulnerabilities of IoT SecurityReview on Vulnerabilities of IoT Security
Review on Vulnerabilities of IoT Securityijtsrd
 
Internet of Things IoT Anytime Anywhere Anything Connectivity
Internet of Things IoT Anytime Anywhere Anything ConnectivityInternet of Things IoT Anytime Anywhere Anything Connectivity
Internet of Things IoT Anytime Anywhere Anything ConnectivityYogeshIJTSRD
 
Hardware/Software Interoperability and Single Point Vulnerability Problems of...
Hardware/Software Interoperability and Single Point Vulnerability Problems of...Hardware/Software Interoperability and Single Point Vulnerability Problems of...
Hardware/Software Interoperability and Single Point Vulnerability Problems of...BRNSS Publication Hub
 
March 2021: Top 10 Read Articles in Network Security and Its Applications
March 2021: Top 10 Read Articles in Network Security and Its ApplicationsMarch 2021: Top 10 Read Articles in Network Security and Its Applications
March 2021: Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
January 2021 - Top 10 Read Articles in Network Security & Its Applications
January 2021 - Top 10 Read Articles in Network Security & Its ApplicationsJanuary 2021 - Top 10 Read Articles in Network Security & Its Applications
January 2021 - Top 10 Read Articles in Network Security & Its ApplicationsIJNSA Journal
 
Most cited articles in academia - International journal of network security &...
Most cited articles in academia - International journal of network security &...Most cited articles in academia - International journal of network security &...
Most cited articles in academia - International journal of network security &...IJNSA Journal
 
Security Vulnerability and Counter Measures in Mobile Ad Hoc Networks
Security Vulnerability and Counter Measures in Mobile Ad Hoc NetworksSecurity Vulnerability and Counter Measures in Mobile Ad Hoc Networks
Security Vulnerability and Counter Measures in Mobile Ad Hoc Networksijtsrd
 

What's hot (19)

Smart city landscape
Smart city landscapeSmart city landscape
Smart city landscape
 
Novel authentication framework for securing communication in internet-of-things
Novel authentication framework for securing communication in internet-of-things Novel authentication framework for securing communication in internet-of-things
Novel authentication framework for securing communication in internet-of-things
 
A fair survey on Internet of Things(IoT)
A fair survey on Internet of Things(IoT)A fair survey on Internet of Things(IoT)
A fair survey on Internet of Things(IoT)
 
November 2021 - Top 10 Read Articles in Network Security & Its Applications
November 2021 - Top 10 Read Articles in Network Security & Its ApplicationsNovember 2021 - Top 10 Read Articles in Network Security & Its Applications
November 2021 - Top 10 Read Articles in Network Security & Its Applications
 
March 2022 - Top 10 Read Articles in Network Security and Its Applications
March 2022 - Top 10 Read Articles in Network Security and Its ApplicationsMarch 2022 - Top 10 Read Articles in Network Security and Its Applications
March 2022 - Top 10 Read Articles in Network Security and Its Applications
 
Internet of Things (IoT): Readme
Internet of Things (IoT): ReadmeInternet of Things (IoT): Readme
Internet of Things (IoT): Readme
 
October 2020 - Top Read Articles in Network Security & Its Applications
October 2020 - Top Read Articles in Network Security & Its ApplicationsOctober 2020 - Top Read Articles in Network Security & Its Applications
October 2020 - Top Read Articles in Network Security & Its Applications
 
December 2021: Top 10 Read Articles in Network Security and Its Applications
December 2021: Top 10 Read Articles in Network Security and Its ApplicationsDecember 2021: Top 10 Read Articles in Network Security and Its Applications
December 2021: Top 10 Read Articles in Network Security and Its Applications
 
Security in the Internet of Things
Security in the Internet of ThingsSecurity in the Internet of Things
Security in the Internet of Things
 
IoT: Ongoing challenges and opportunities in Mobile Technology
IoT: Ongoing challenges and opportunities in Mobile TechnologyIoT: Ongoing challenges and opportunities in Mobile Technology
IoT: Ongoing challenges and opportunities in Mobile Technology
 
Review on Vulnerabilities of IoT Security
Review on Vulnerabilities of IoT SecurityReview on Vulnerabilities of IoT Security
Review on Vulnerabilities of IoT Security
 
Internet of Things IoT Anytime Anywhere Anything Connectivity
Internet of Things IoT Anytime Anywhere Anything ConnectivityInternet of Things IoT Anytime Anywhere Anything Connectivity
Internet of Things IoT Anytime Anywhere Anything Connectivity
 
Hardware/Software Interoperability and Single Point Vulnerability Problems of...
Hardware/Software Interoperability and Single Point Vulnerability Problems of...Hardware/Software Interoperability and Single Point Vulnerability Problems of...
Hardware/Software Interoperability and Single Point Vulnerability Problems of...
 
Internet of Things (IoT)
Internet of Things (IoT)Internet of Things (IoT)
Internet of Things (IoT)
 
March 2021: Top 10 Read Articles in Network Security and Its Applications
March 2021: Top 10 Read Articles in Network Security and Its ApplicationsMarch 2021: Top 10 Read Articles in Network Security and Its Applications
March 2021: Top 10 Read Articles in Network Security and Its Applications
 
January 2021 - Top 10 Read Articles in Network Security & Its Applications
January 2021 - Top 10 Read Articles in Network Security & Its ApplicationsJanuary 2021 - Top 10 Read Articles in Network Security & Its Applications
January 2021 - Top 10 Read Articles in Network Security & Its Applications
 
Most cited articles in academia - International journal of network security &...
Most cited articles in academia - International journal of network security &...Most cited articles in academia - International journal of network security &...
Most cited articles in academia - International journal of network security &...
 
Security Vulnerability and Counter Measures in Mobile Ad Hoc Networks
Security Vulnerability and Counter Measures in Mobile Ad Hoc NetworksSecurity Vulnerability and Counter Measures in Mobile Ad Hoc Networks
Security Vulnerability and Counter Measures in Mobile Ad Hoc Networks
 
IOTCYBER
IOTCYBERIOTCYBER
IOTCYBER
 

Similar to Data Science for Smart Home IoT Network Analysis

SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGS
SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGSSECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGS
SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGSIJNSA Journal
 
A fair survey on internet of Things
A fair survey on internet of ThingsA fair survey on internet of Things
A fair survey on internet of ThingsJOHARAN M.Jo
 
Architectural Layers of Internet of Things: Analysis of Security Threats and ...
Architectural Layers of Internet of Things: Analysis of Security Threats and ...Architectural Layers of Internet of Things: Analysis of Security Threats and ...
Architectural Layers of Internet of Things: Analysis of Security Threats and ...Scientific Review SR
 
The Internet of Things (IoT) and its evolution
The Internet of Things (IoT) and its evolutionThe Internet of Things (IoT) and its evolution
The Internet of Things (IoT) and its evolutionSathvik N Prasad
 
76 s201918
76 s20191876 s201918
76 s201918IJRAT
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)Ankur Pipara
 
Internet of things-IoT.pptx
Internet of things-IoT.pptxInternet of things-IoT.pptx
Internet of things-IoT.pptxMukulislam1
 
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdf
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdfWeek 8 - Module 19 - PPT- Internet of Things for Libraries.pdf
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdfMohamedAli899919
 
Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere IJECEIAES
 
Mphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipMphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipNeha Yadav
 
Mphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipMphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipNeha Yadav
 
October 2022: Top 10 Read Articles in Network Security & Its Applications
October 2022: Top 10 Read Articles in Network Security & Its ApplicationsOctober 2022: Top 10 Read Articles in Network Security & Its Applications
October 2022: Top 10 Read Articles in Network Security & Its ApplicationsIJNSA Journal
 
January 2024 - Top 10 Read Articles in Network Security & Its Applications
January 2024 - Top 10 Read Articles in Network Security & Its ApplicationsJanuary 2024 - Top 10 Read Articles in Network Security & Its Applications
January 2024 - Top 10 Read Articles in Network Security & Its ApplicationsIJNSA Journal
 
April 2022 - Top 10 Read Articles in Network Security and Its Applications
April 2022 - Top 10 Read Articles in Network Security and Its ApplicationsApril 2022 - Top 10 Read Articles in Network Security and Its Applications
April 2022 - Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
Abid - Final Presentation .pptx
Abid - Final Presentation .pptxAbid - Final Presentation .pptx
Abid - Final Presentation .pptxSyedSaqlain32
 
May 2022: Top 10 Read Articles in Network Security and Its Applications
May 2022: Top 10 Read Articles in Network Security and Its ApplicationsMay 2022: Top 10 Read Articles in Network Security and Its Applications
May 2022: Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
June 2022: Top 10 Read Articles in Network Security and Its Applications
June 2022: Top 10 Read Articles in Network Security and Its ApplicationsJune 2022: Top 10 Read Articles in Network Security and Its Applications
June 2022: Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 
August 2022: Top 10 Read Articles in Network Security and Its Applications
August 2022: Top 10 Read Articles in Network Security and Its ApplicationsAugust 2022: Top 10 Read Articles in Network Security and Its Applications
August 2022: Top 10 Read Articles in Network Security and Its ApplicationsIJNSA Journal
 

Similar to Data Science for Smart Home IoT Network Analysis (20)

SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGS
SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGSSECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGS
SECURITY& PRIVACY THREATS, ATTACKS AND COUNTERMEASURES IN INTERNET OF THINGS
 
A fair survey on internet of Things
A fair survey on internet of ThingsA fair survey on internet of Things
A fair survey on internet of Things
 
Architectural Layers of Internet of Things: Analysis of Security Threats and ...
Architectural Layers of Internet of Things: Analysis of Security Threats and ...Architectural Layers of Internet of Things: Analysis of Security Threats and ...
Architectural Layers of Internet of Things: Analysis of Security Threats and ...
 
The Internet of Things (IoT) and its evolution
The Internet of Things (IoT) and its evolutionThe Internet of Things (IoT) and its evolution
The Internet of Things (IoT) and its evolution
 
76 s201918
76 s20191876 s201918
76 s201918
 
PhD Admission Pitching
PhD Admission PitchingPhD Admission Pitching
PhD Admission Pitching
 
IoT
IoTIoT
IoT
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)
 
Internet of things-IoT.pptx
Internet of things-IoT.pptxInternet of things-IoT.pptx
Internet of things-IoT.pptx
 
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdf
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdfWeek 8 - Module 19 - PPT- Internet of Things for Libraries.pdf
Week 8 - Module 19 - PPT- Internet of Things for Libraries.pdf
 
Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere Internet of Things: Surveys for Measuring Human Activities from Everywhere
Internet of Things: Surveys for Measuring Human Activities from Everywhere
 
Mphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipMphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internship
 
Mphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internshipMphasis ppt on internet of things for internship
Mphasis ppt on internet of things for internship
 
October 2022: Top 10 Read Articles in Network Security & Its Applications
October 2022: Top 10 Read Articles in Network Security & Its ApplicationsOctober 2022: Top 10 Read Articles in Network Security & Its Applications
October 2022: Top 10 Read Articles in Network Security & Its Applications
 
January 2024 - Top 10 Read Articles in Network Security & Its Applications
January 2024 - Top 10 Read Articles in Network Security & Its ApplicationsJanuary 2024 - Top 10 Read Articles in Network Security & Its Applications
January 2024 - Top 10 Read Articles in Network Security & Its Applications
 
April 2022 - Top 10 Read Articles in Network Security and Its Applications
April 2022 - Top 10 Read Articles in Network Security and Its ApplicationsApril 2022 - Top 10 Read Articles in Network Security and Its Applications
April 2022 - Top 10 Read Articles in Network Security and Its Applications
 
Abid - Final Presentation .pptx
Abid - Final Presentation .pptxAbid - Final Presentation .pptx
Abid - Final Presentation .pptx
 
May 2022: Top 10 Read Articles in Network Security and Its Applications
May 2022: Top 10 Read Articles in Network Security and Its ApplicationsMay 2022: Top 10 Read Articles in Network Security and Its Applications
May 2022: Top 10 Read Articles in Network Security and Its Applications
 
June 2022: Top 10 Read Articles in Network Security and Its Applications
June 2022: Top 10 Read Articles in Network Security and Its ApplicationsJune 2022: Top 10 Read Articles in Network Security and Its Applications
June 2022: Top 10 Read Articles in Network Security and Its Applications
 
August 2022: Top 10 Read Articles in Network Security and Its Applications
August 2022: Top 10 Read Articles in Network Security and Its ApplicationsAugust 2022: Top 10 Read Articles in Network Security and Its Applications
August 2022: Top 10 Read Articles in Network Security and Its Applications
 

Recently uploaded

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Data Science for Smart Home IoT Network Analysis

  • 1. Data Science for Internet of Things Case Study: Smart Home Devices Olivera Kotevska, PhD kolivera@ieee.org
  • 2. Goal of this tutorial  Presents what is Internet of Things (IoT)  Eco system, protocols, architectures, and challenges  Presents some of the IoT vulnerabilities  Vulnerabilities of IoT - security and privacy aspects  Overview at some of the methods to overcome these vulnerabilities  Use a practical example of Smart Home network  Present a set of methods for analyzing IoT network behavior
  • 3. Outline  Theoretical part  IoT systems  Challenges  Ways to overcome some of those challenges  Role of data science in IoT network traffic - Approaches: Pattern detection, Classification  Practical part - Use case: Smart Home network  Setting up the environment using R and RStudio environment  Set of data science algorithms for IoT network analysis using R & R Studio  Takeaways  Few things to remember  References
  • 4. Internet of Things (IoT) – history and definition  First mentioned in 1999 by Kevin Ashton during his work at Proctor & Gamble.  IoT is a network of things which enables these things to connect and exchange data, resulting in efficiency improvements.  Things refers to wide variety of objects – devices, systems, or human.  The resemblance between all IoT things is the ability to connect to the Internet.  IoT has been called the Third Wave in information industry. https://appsec-labs.com/iot-security/
  • 5. Areas of IoT  General classification areas based on target audience (Individuals, Society, Industry):  Smart Buildings, Office and Homes  Smart Energy meters  Smart Agriculture and Water monitoring  Water levels  Smart Manufacturing and Industry  Industrial asset monitoring  Smart Mobility and Transport  Smart Health and Medicine  Fitness trackers  Smart Home  Temperature adjustments https://brewcitypc.com/home-services/ http://www.exchangecommunications.co.uk Tenzin, S. et al. (2017) KST, 172-177. https://www.i-scoop.eu/
  • 6. The IoT Comic Book is a result of the EU Internet of Things Initiative. https://iotcomicbook.org/
  • 7. Growing number of IoT devices  General goals of IoT are to:  Maximize health and safety  Maximizing the convenience of its execution while minimize the amount of work  Minimize the costs “The number of IoT devices increased 31% year-over-year to 8.4 billion in 2017 and it is estimated that there will be 30 billion devices by 20201.” 1 Nordrum, Amy (18 August 2016). "Popular Internet of Things Forecast of 50 Billion Devices by 2020 Is Outdated". IEEE.
  • 8. Understanding the IoT eco system  The whole task of IoT is:  IoT devices collect the information from the environment  knowledge should be extracted from the raw data  the data will be ready for transfer to other objects, devices, or servers through the Internet  IoT includes four main components: sensors, networks, analyzing data, monitoring the system. A high-level system model of IoT Ammar, M., Russello, G., & Crispo, B. (2018). Internet of Things: A survey on the security of IoT frameworks. Journal of Information Security and Applications, 38, 8-27.
  • 9. 1. IoT Sensors  Connectivity capabilities  Reach the outside world (e.g. cloud) directly e.g. phone  Others must connect to a hub or gateway first e.g. smart camera  Processing capabilities  Processing on the sensor  Processing on the hub or cloud  Combination between both Gope, P., & Hwang, T. (2016). IEEE Sensors Journal, 16, 1368-1376. Motion sensor Door & window sensor Raspberry Pi
  • 10. 2. IoT Network  IoT communication protocols  Device to Device  Device to Server  Server to Server  Networks characteristics  All network topologies are supported – star, mesh, device-to-device  Wide range of verticals with specific network requirements  Wide variety of devices of different capabilities Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & Ayyash, M. (2015). Internet of things: A survey on enabling technologies, protocols, and applications. IEEE Communications Surveys & Tutorials, 17(4), 2347-2376.
  • 11. IoT Network  Integration of multiple network standards and protocols  Short range networks  RFID, NFC, Bluetooth, Ant, EnOcean, Z- Wave, Insteon, ZigBee, MiWi, DigiMesh, WirelessHART, Thread, 7LowPAN, Wi-Fi  Long range networks  LoRaWAN, Symphoney Link, Weightless, SIGFOX, DASH7 BluetoothBluetooth Wi-Fi Wi-Fi ZigBee https://www.nist.gov/blogs/taking-measure/cybersecuring-internet-things LoRaWAN
  • 12. IoT vs Non-IoT network traffic IoT network traffic  Environmental characteristics  heterogeneous, nonlinear, distributed  Data characteristics  volume, variety, variability, data types  Traffic patterns change more frequently “The availability of IoT traffic not only creates new application opportunity's such as remote camera monitoring, but it changes the distribution of Internet traffic e.g. recent study shows that video streaming via Netflix accounts for 32.7% of peak downstream traffic in US.” 1 Non-IoT network traffic  Environmental characteristics  homogeneous  Data characteristics  volume  Traffic patterns are more consistent 1 https://www.cnn.com/2011/10/27/tech/web/netflix-internet-bandwith-mashable/index.html
  • 13. 3. IoT Analysis  Refers to analyzing and examining the data obtained by the IoT  Key components of collection of IoT data - sensors, network end devices, and other data storing and transmitting equipment  It is a service that runs and operationalize sophisticated analytics on massive volumes of IoT data.
  • 14. 3. IoT Analysis - Traffic characteristics  Transmission characteristics  At anytime of the day  From locations not accessible to humans  May be coordinated, synchronized  Periodic or event-driven  Real time and non-real time  Sleep time  Short and small packets  Properties  Network size & Message size  Traffic rate & interval  Demands on QoS  Energy source https://www.ericsson.com/en/mobility-report/massive-iot-in-the-city
  • 15. 4. IoT system monitoring – life cycle Act Sense Commun icate Analyze (Store, Process) Visualize
  • 16. ... but there is a dark side – some of IoT challenges  Hardware  Cost of devices, battery life, physical specifications  Communication and networking  Coverage - High complexity of distributed computing  Scalability and diversity - Various communication protocols  Reliability  Software  Interoperability, data processing, context awareness  Security  C.I.A. – Confidentiality, Integrity, and Availability  Attack resistance Zhang, Z. K. et al.(2014, November). IoT security: ongoing challenges and research opportunities. In Service-Oriented Computing and Applications (SOCA), 2014 IEEE 7th International Conference on (pp. 230-234). IEEE.
  • 17. IoT Security risks  Number of IoT device and gadgets develops each day so the security danger and potential difficulties are likewise develops alongside that.  Potential vulnerabilities are:  User based o Node capture and eavesdropping o Controlling the data o Access attacks and privacy attacks  System based o Distribution and denial-of-service attacks o Complexity of vulnerability o Bandwidth constraints Mendez, D. M., Papapanagiotou, I., & Yang, B. (2017). Internet of things: Survey on security and privacy. arXiv:1707.01879.
  • 18. Famous cybersecurity attacks  Massive target hack and the security breach at Equifax which comprised sensitive personal information. (2017)  Mirai Botnet (aka Dyn Attack) – affected computers continually search the internet for vulnerable IoT devices and then use known default usernames and passwords to login, infecting them with malware. These devices were things like digital cameras and DVR players. (2016)  Other attacks related to health and human well-being are:  In Finland cybercriminals shut down the heating of two buildings in the city. (2016)  FDA confirmed that St. Jude Medical’s implantable cardiac devices have vulnerabilities that could allow a hacker to access a device. (2017)
  • 19. IoT security architecture Jing, Q., Vasilakos, A. V., Wan, J., Lu, J., & Qiu, D. (2014). Security of the Internet of Things: perspectives and challenges. Wireless Networks, 20(8), 2481-2501.
  • 20. Application layer characteristics  Application layer – setting and ensuring common grounds - reaching the applications - for communication  Users can access to the IoT through the application layer interface using TV, PC, mobile equipment and so forth  It provides the personalized services according to the needs of the users  Can be thought of as the one that users interact with and as the one witch defines how process-to-process communications take place
  • 21. Protocols in different layers TCP/IP OSI Model Protocols Application layer Application layer DNS, DHCP, FTP, HTTP(S), IMAP, NTP, POP3, SMTP, SNMP, TFTP, Telnet, RTP, RTSP, CoAP, MQTT, XMPP, AMQP Presentation layer JPEG, MIDI, MPEG, PICT, TIFF Session layer NetBIOS, NFS, PAP, SCP, SQL, ZIP Transport layer Transportation layer TCP, UDP Internet layer Network layer ICMP, IGMP, IPsec, IPv4/IPv6, IPX, RIP Link layer Data Link layer ARP, ATM, CDP, FDDI, Frame Relay, HDLC, MPLS, PPP, STP, Token Ring Physical layer Bluetooth, Ethernet, DSL, ISDN, 802.11 Wi-Fi Dizdarevic, J. et. al. (2018). Survey of Communication Protocols for Internet-of-Things and Related Challenges of Fog and Cloud Computing Integration. arXiv preprint arXiv:1804.01747. Protocols are required in order to identify the spoken language of the IoT devices
  • 22. IoT data characteristics and challenges  Data collection challenges  Smart devices generate data on a continuous manner  Data generation varies for different devices – processing data with different generation rates is a challenge  Heterogeneous sources  Dynamic nature of data – moving sensors like cars  Quality of collected data  Error in measurements or precision of data collection  Devices’ noise in the environment  Discrete observation and measurement  Trustworthy sources Mahdavinejad, M. S. et al. (2017). Machine learning for Internet of Things data analysis: A survey. Digital Communications and Networks.
  • 23. Analytical capabilities  Intelligent processing and analysis of the data is the key to developing smart IoT applications.  Classification of analytical capabilities consisting of five categories:  Descriptive – What happened?  Diagnostic – Why something has happened?  Discovery – What happened that we don’t know about?  Predictive – What is likely to happen?  Prescriptive – What if?  The need to deal with network traffic patterns, large datasets and multidimensional spaces of flow and packet attributes is one of the reasons for using ML in this field. Siow, E., Tiropanis, T., & Hall, W. (2018). Analytics for the Internet of Things: A Survey. ACM Computing Surveys (CSUR), 51(4), 74.
  • 24. What is Data Science (DS)?  DS uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data  DS employs techniques and theories from many fields including machine learning (ML)  ML is a the process of finding and describing structural patterns in a supplied dataset  ML takes input an the form of dataset of instances  Each instance is characterized by the values of its features  Different styles of learning  Descriptive  Discovery  Diagnostic  Predictive
  • 25. ML approach to IoT traffic analysis (1/2)  Essential concepts for determining right algorithms to use 1. IoT application  Privacy of collected data  Security parameters such as network security, and data encryption 2. IoT data characteristics  Volume, velocity, varieties of data, quality of data 3. IoT data analytics algorithms - data-driven vision of ML algorithms  Analyze the data from variety of sources in real time  Algorithms that tolerate noisy data  Algorithms that can work with small amount of labeled data Xiao, L., Wan, X., Lu, X., Zhang, Y., & Wu, D. (2018). IoT Security Techniques Based on Machine Learning. arXiv preprint arXiv:1801.06275. Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., & Sheth, A. P. (2017). Machine learning for Internet of Things data analysis: A survey. Digital Communications and Networks.
  • 26. ML approach to IoT traffic analysis (2/2)  Traffic measurements in operational networks help to  understand traffic characteristics in deployed networks  develop traffic models  evaluate performance of protocols and applications  Traffic analysis  provides information about the user behavior patterns  enables network operators to understand the behavior of network users  Traffic prediction  important to assess future network capacity requirements and to plan future network developments
  • 27. Most used ML approaches for IoT traffic analysis  ML biggest strength in security is training to understand what is "baseline" or "normal" for a system, and then flagging anything unusual for human review.  In order to draw the right decision for data analysis, it is necessary to determine which one of the tasks whether:  Understanding the data / Pattern detection  Dynamic over time  Communication among devices  Traffic similarity between devices  Identify the features that characterize the importance of the specific device traffic  Destination IP, Number of packets  Classify each device and device type
  • 28. Pattern detection  Pattern detection is the automated recognition of patterns and regularities in data  Best fitting distributions are determined by:  Visual inspection of the distribution of the trace and the candidate distributions  Statistical test of potential candidates  Example:  Correlation analysis between variables, checking the distribution, extreme value and so forth. “Characterizing and Classifying IoT Traffic in Smart Cities and Campuses”, A. Sivanathan et al, 2016
  • 29. Feature selection  The quality of feature set is crucial to the performance of a ML algorithm. 𝒳 = 𝒳1 𝒳2 𝒳3 …  Algorithms  Filter method – make independent assessment based on general characteristics of the data. They rely on certain metric to rate and select best subset before the learning. The results should not be biased toward a particular ML algorithm.  Wrapper method – evaluate the performance of different subsets using ML algorithms that will be employed.  Example: Correlation-based feature selection filter techniques, Regression such as Lasso (L1), PCA, CCA, Greedy, Best-First/Genetic search. most suitable features that characterize the data set
  • 30. Classification  Classification algorithm seeks to classify an object into a finite set of categories.  Supervised classifiers aim to build a concise model of the class label distribution based on features of the classifiable objects.  Example: Random Forest, Support Vector Machines, Ada Boost, Decision Trees, Naïve Bayes and so forth.
  • 31. Classification – Dataset split methodology Dataset Preprocessing Training test 80% Feature selection Validation test 20% Training 80% Model building Test parameter tuning Model testing Testing test 20% Training dataset is a dataset of examples used for learning Test set is therefore a set of examples used only to assess the performance Validation dataset is a set of examples used to tune the parameters of a classifier
  • 32. Hands-on experience - Steps 1. Use case description – Smart Home 2. Description of the dataset and collection process 3. What is R and R Studio? Basic commands and understanding 4. Traffic analysis pipeline: a. Read the dataset and present the general characteristics of home network traffic b. Visualizing the properties of the network files c. Identifying the patterns d. Feature selection techniques e. Device classification dreamstime.com
  • 33. 1. Use case: Smart Home network  Smart Home network contains a set of devices – the network traffic captures the behavior from unique perspective.  Combined together these devices they provide a broad picture of home network traffic and more importantly reveal interesting traffic activities in home networks. http://telano.info/wp-content/uploads/2018/01/house-security-system-intended-for- my-home-connect-highlights-systems-decor-prices-project-mini-india.jpg
  • 34. 2. Dataset description  Open dataset from Aalto University School of Science, Finland 1  Traffic from 27 smart home IoT devices such as smart camera, light, coffee maker, iKattle … 1 https://research.aalto.fi/en/datasets/iot-devices-captures(285a9b06-de31-4d8b-88e9-5bdba46cc161).html
  • 35. Data collection process  The typical device setup process was repeated 20 times in order to generate sufficient fingerprints for each device.  Typically, a setup procedure for a device involved:  activating the device  connecting to the device directly over Wi-Fi or Ethernet with the help of a vendor- provided application  transmitting Wi-Fi credentials to the user’s network over this connection to the device, after this  the device would typically reset and connect to the user’s network using the provided credentials
  • 36. 3. R & R Studio  R is a programming language for statistical computing and graphics  RStudio is a free and open source integrated development environment (IDE) for R  R packages Download - https://www.rstudio.com/products/rstudio/download/
  • 37. Create your first project in R Studio 1. File > New Project > New Directory > New Project > Create directory name and choose the path to save the project 2. File > New File > R Script 3. File > New File > R Notebook 1. Remove what was written in after ```{r} and before ``` and write print ("Hello there") 2. Execute the program: Preview > Preview Notebook 2. 3.
  • 38. R basic commands (1/3)  Variables: x <- 10 or x = 10 also name <- “Camera Front”  Display the type and length use: mode(x) and length(x)  List the elements in the memory: ls()  Writing your own functions passgen <- function(n, m) { r <- sample(n:m, 1) return (pass) }  Creating condition: if (old == pass) pass = passgen(1,10) else print(paste(“New password is”, pass))  Creating loops: while (x > 0) { for (I in 1:length(x)) { ... … } } Output: > x = 10 > name = "Camera Front" > > mode(x) [1] "numeric" > length(x) [1] 1 > > mode(name) [1] "character" > length(name) [1] 1 > > # random integer number generator > passgen <- function(n, m) { + pass <- sample(n:m, 1) + return (pass) + } > pass = passgen(1, 10) > > old = 5 > if (old == pass) pass = passgen(1, 10) else print(paste("New password is ", pass)) [1] "New password is 2" Paradis, E. 2002. R for Beginners. Montpellier (F): University of Montpellier URL http://cran.r-project.org/doc/contrib/rdebuts_en.pdf
  • 39. R basic commands (2/3)  Installing packages in R install.packages(”curl") or Tools > Install Packages … > [Type the name in the ’Packages’ box]  Call the package functionality library(curl) or require(curl)
  • 40. R working with dataset (3/3)  Data objects: Vector, Factor, Matrix, Data frame, List, Time-series, Expression  Read dataset data = read.csv(file, header = TRUE, sep = ",", quote=""", dec=".", fill = TRUE, ...)  Get the column names of the data frame object names(data)  The data frame can accessed various ways:  data$[name of the column] # by field name or data$[name of the column][1:5] # select elements of it  data[1,] # first by row or data[1,2] # get one element  Visualize the data x <- rnorm(10) y <- rnorm(10) plot(x,y) plot(x, y, xlab="Ten random values", ylab="Ten other values", xlim=c(-2, 2), ylim=c(-2, 2), pch=22, col="red", bg="yellow", bty="l", tcl=0.4, main="How to customize a plot with R", las=1, cex=1.5) Rossiter, D. G. (2012). Introduction to the R Project for Statistical Computing for use at ITC. International Institute for Geo-information Science & Earth Observation (ITC), Enschede (NL), 3, 3-6.
  • 41. 4. Traffic analysis pipeline Preprocessing • Check for missing values (several techniques) • Convert IP address to domain names • Remove redundant records Understanding the data • Data exploration techniques to check distribution of the data • Check and deal with outliers: statistical methods, wavelets, Principal Component Analysis or Partial Least Square Feature selection • Statistical analysis • Visual analysis – data exploration • Check correlation between features Classification • Identify destination address based on protocol and source address • Identify mean IAT based on median, min, max, variance of IAT • Identify device type based on the highest probability
  • 42. Read dataset (1/2)  Network traffic is captured in pcap (packet capture) file format  You can open with Wireshark 1  Extract the fields from the pcap file using tshark command 2 Open the terminal and type tshark -nr [path to pcap file] / [name of file].pcap -T fields -e ip.dst -e ip.src -e ip.proto> length_counts.txt more length_counts.txt | sort -n | uniq -c > table_lengths.csv  R code require(curl) data <-read.csv(curl("https://raw.githubusercontent.com/CloudDemo/traffic- analysis/master/dataset_X.csv")) head(data) # look at the first 10 rows of data summary(data) # look at the data properties 1 Download: https://www.wireshark.org/download.html 2 Tshark synopsis: https://www.wireshark.org/docs/man-pages/tshark.html
  • 44. Traffic analysis pipeline - Preprocessing Preprocessing • Check for missing values (several techniques) • Remove redundant records • Convert IP address to domain names Understanding the data • Data exploration techniques to check distribution of the data • Check and deal with outliners: statistical methods, wavelets, Principal Component Analysis or Partial Least Square Feature selection • Statistical analysis • Visual analysis – data exploration • Check correlation between features Classification • Identify destination address based on protocol and source address • Identify mean IAT based on median, min, max, variance of IAT • Identify device type based on the highest probability
  • 45. Preprocessing (1/2)  Check for missing values  Remove records with missing values  Replace the missing fields with column mean of that feature # Handeling missing data is.na(data) #returns TRUE of elements in 'data' are missing newdata <- na.omit(data) # create new dataset without missing data # Replace missing values with column mean # Instead of Mean can be used Median, Standard Deviation for(i in 1:ncol(data)){ data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE) }
  • 46. Preprocessing (2/2)  Redundant records  How many and what was the case?  Four types of network anomalies can be detected:  invalid TCP flag combinations  large number of TCP resets  UDP and TCP port scans  traffic volume anomalies # Removing duplicate rows duplicated(data) # Find the position data[!duplicated(data)] # Another method for removing duplicate rows using 'dplyr' package install.packages("dplyr") # Install library("dplyr") # Load distinct(data) # Remove duplicate based on all columns distinct(data, '') # Remove duplicated rows based on ''
  • 47. Traffic analysis pipeline – Understanding the data Preprocessing • Check for missing values (several techniques) • Remove redundant records • Convert IP address to domain names Understanding the data • Data exploration techniques to check distribution of the data • Check and deal with outliners: statistical methods, wavelets, Principal Component Analysis or Partial Least Square Feature selection • Statistical analysis • Visual analysis – data exploration • Check correlation between features Classification • Identify destination address based on protocol and source address • Identify mean IAT based on median, min, max, variance of IAT • Identify device type based on the highest probability
  • 48. Understanding the data  Visualization library – ggplot2  Types of visualizations (most used):  Scatter Plot - relationship between two continuous variables  Histogram - where and how the data points are distributed  Line Chart – progress of the data points over a period  Bar Chart – comparison between categorical variables  Tree Map – displaying hierarchical data  Area Chart - how a particular metric performed compared to a certain baseline Flow chart produces by Dr. Andrea Abela, Chairman of the Department of Business & Economics at the Catholic University of America in Washington, DC
  • 49. Understanding the data (1/3)  Network traffic direction can be categorized in the following groups:  Flow or uni-directional flow:  A series of packets sharing the same file-tuple: source IP, destination IP, source port, destination port, protocol  Bi-directional flow:  Is a pair or unidirectional flows going in the opposite directions between the same source and destination IP.  Full-flow:  A bi-directional flow captured over its entire lifetime, from establishment to the end of the communication connection. library(plyr) library(ggplot2) heatmap_data = ddply(data,.(App.protocol, Destination.adr, Destination.port, Source.adr, Source.port),nrow) head(heatmap_data) heatmap_data$protocol = factor(heatmap_data$protocol) ggplot(heatmap_data, aes(x = heatmap_data$protocol)) + geom_bar() + xlab('Protocol type') + ylab('Total number of packets') Bar chart visualization
  • 50. Understanding the data (2/3) # List all the devices and protocol type usage heatmap_port = ddply(heatmap_data,.(source.adr, protocol),nrow) ggplot(data=heatmap_port, aes(x=Protocol, y=SourceAdr)) + geom_point() + xlab('Protocol') + ylab('Source IP’) # List selected devices per MAC address selected = heatmap_data[heatmap_data$source.adr == "74:da:38:80:79:fc" | heatmap_data$source.adr == "74:da:38:80:7a:08" | heatmap_data$source.adr == "'b0:c5:54:1c:71:85" | heatmap_data$source.adr == "b0:c5:54:25:5b:0e" | heatmap_data$source.adr == "5c:cf:7f:07:ae:fb" | heatmap_data$source.adr == "5c:cf:7f:06:d9:02", ] heatmap_port = ddply(selected,.(Source.adr, App.protocol),nrow) names(heatmap_port) <- c('SourceAdr', 'Protocol', 'Freq’) ggplot(data=heatmap_port, aes(x=Protocol, y=SourceAdr)) + geom_point() + xlab('Protocol') + ylab('Source IP') Scatter plot visualization D-LinkCam EdimaxCam SmarterCoffee iKettle2 EdimaxCam
  • 51. Understanding the data (3/3)  IAT – Inter Arrival Time  The time between the “start” of two events.  In network context the time between arrival time of two packages. previous = 0 IA_times = c() j = 0 for (i in 1:dim(data)[1]) { if (previous == 0) { previous = data[i,4] } else { j = j + 1 IA_times[j] = data[i,4] - previous previous = data[i,4] } } IAT_times <- as.data.frame(IA_times) ggplot(IAT_times, aes(x=IAT_times$c, y=IAT_times$IA_times)) + geom_line(color='steelblue') + xlab('Number of packets') + ylab('Inter Arrival Time') Line chart visualization
  • 52. Traffic analysis pipeline – Feature selection Preprocessing • Check for missing values (several techniques) • Remove redundant records • Convert IP address to domain names Understanding the data • Data exploration techniques to check distribution of the data • Check and deal with outliners: statistical methods, wavelets, Principal Component Analysis or Partial Least Square Feature selection • Statistical analysis • Check correlation between features • Visual analysis – data exploration Classification • Identify destination address based on protocol and source address • Identify mean IAT based on median, min, max, variance of IAT • Identify device type based on the highest probability
  • 53. Feature selection (1/5)  Pearson Correlation - is a number between -1 and 1 that indicates the extent to which two variables are linearly related.  Chi-Square - is used to determine relationship between two variables. Traffic observations Post processing Traffic analysis Results Feature vector Packet/Flow Data 𝜌 𝜒,𝛾 = 𝑐𝑜𝑣 (𝜒, 𝛾) 𝜎 𝜒 𝜎 𝛾 𝒳2 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 Probability of having those results considering the null hypothesis is true, which is the variables are not associated.
  • 54. Feature selection (2/5)  Filter method analysis using Pearson’s correlation with package ‘caret’ library(caret) correlationMatrix <- cor(basic.features[,5:7]) # calculate correlation matrix print(correlationMatrix) # summarize the correlation matrix # find attributes that are highly corrected (ideally >0.75) highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5) print(highlyCorrelated) # print indexes of highly correlated attributes
  • 55. Feature selection (3/5)  Filter method analysis using Chi-Square correlation with package ‘MASS’ library(MASS) tbl = table(selected$V1, selected$V2) print(tbl) chisq.test(tbl) # is the destination IP independent of the protocol at .05 significance level
  • 56. Feature selection (4/5)  Wrapper analysis using regression technique with package ‘‘gml’  We used features from TCP/IP Header, packet time, and IAT value Fonti, V., & Belitser, E. (2017). Feature selection using lasso. VU Amsterdam Research Paper in Business Analytics. # Fit a logistic regression model fit_glm = glm(selected$V1 ~ selected$V5, family = "binomial”, selected) summary(fit_glm) # generate summary
  • 57. Feature selection (5/5)  For loop going through all the variables as potentials for finding which features is more relevant #No. of cols in data frame c <- ncol(selected) # Intializing the vector which will contain the p-values of all variables pvalues <- numeric(c) # Getting the p-values for (i in 1:c) { fit <- glm(selected$V1 ~ selected[,i], selected, family = "binomial") summ <- summary(fit) pvalues[i] <- summ$coefficients[2,4] } ord <- order(pvalues) x10 <- selected[,ord] names(x10)
  • 58. Traffic analysis pipeline - Classification Preprocessing • Check for missing values (several techniques) • Remove redundant records • Convert IP address to domain names Understanding the data • Data exploration techniques to check distribution of the data • Check and deal with outliners: statistical methods, wavelets, Principal Component Analysis or Partial Least Square Feature selection • Statistical analysis • Visual analysis – data exploration • Check correlation between features Classification • Identify destination address based on protocol and source address • Identify mean IAT based on median, min, max, variance of IAT • Identify device type based on the highest probability
  • 59. Classification (1/4)  A common way to characterize a classifier’s accuracy is through metrics known as False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN) by using confusion matrix.  ML literature often utilizes three additional metrics known as Recall, Precision, and F-measure. Classified as -> X Y X TP FN Y FP TN Confusion matrix 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 𝑃𝑖 − 𝑂𝑖 2 𝓃  Another used metric is Root Mean Square Error (RMSE), that measure the differences between values predicted by the model Pi and observed values Oi.
  • 60. Classification (2/4)  Identify protocol used based on the destination address using decision tree algorithm # load the required libraries library(rpart) suppressPackageStartupMessages(library(Metrics)) # Used for calculating root mean squared log error suppressPackageStartupMessages(library(caret)) # For data partition suppressPackageStartupMessages(library(rpart.plot)) # For plotting the model selected$V1 = as.integer(selected$V1) train_set = selected[1:5917,] test_set = selected[5917:6574,] model_rf <- rpart(V1 ~ V2,data=train) cv_prediction_set <- predict(model_rf,cv_set)
  • 61. Classification (3/4)  Calculate the error library(Metrics) rmse <- sqrt(mean((cv_set$V1 - cv_prediction_set) ^ 2)) or rmse <- rmse(test_set$V1, prediction_set) paste("The root mean square error is", rmse,sep=" ") rpart.plot(model_rf)
  • 62. Classification (4/4)  Identify protocol used based on the destination address using Random Forest aalgorithm # Random Forest classifier library(randomForest) #applying Random Forest model_rf <- randomForest(V1 ~ V2, data = train_set) preds <- predict(model_rf,test_set) table(preds) accuracy(preds, test_set$V1) #checking accuracy
  • 63. Take a ways from Theoretical Part – Things to remember  IoT systems are more and more present in our daily life and we need to be informed how to use it and what are the security and privacy risks  There are methods that can help to protect our privacy and security  ML techniques can help us to develop a tools that can help to protect our privacy and security
  • 64. Take a ways from Practical Part – Things to remember  This work presented several steps that should be considered when pre- processing raw network traffic data for data mining.  Some of these steps are essential, some others can be optional.  Provided a case study using a freely available dataset.  We show basic information of using R and R Studio environment – platform mostly used by statistician  We show how to use one of the most widely used ML algorithm Random Forest  Results show the steps are indeed key to obtaining reliable results.
  • 65. References  Wired  https://www.wired.com/story/firewalls-dont-stop-hackers-ai-might/  https://www.wired.com/story/ai-machine-learning-cybersecurity/  RStudio videos and examples  https://github.com/rstudio/webinars  Visualizations in R  Top 50 ggplot2 Visualizations - http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R- Code.html  Smart Home  Amar, Y., Haddadi, H., Mortier, R., Brown, A., Colley, J., & Crabtree, A. (2018). An Analysis of Home IoT Network Traffic and Behaviour. arXiv preprint arXiv:1803.05368.  Network traffic analysis  Nguyen, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials, 10(4), 56-76.  Mohammadi, M., Al-Fuqaha, A., Sorour, S., & Guizani, M. (2018). Deep Learning for IoT Big Data and Streaming Analytics: A Survey. IEEE Communications Surveys & Tutorials.
  • 66. Thank you a lot!  Code and dataset are available - https://github.com/CloudDemo/traffic-analysis  Email: kolivera@ieee.org

Editor's Notes

  1. Hello everyone and Welcome to this tutorial. My name is Olivera Kotevska and I am an International Postdoctoral Guest Researcher at National Institute of Standards and Technology. And today I am going to present to you the tutorial of ’Data Science for IoT’ How many of your are familiar with the terminology of IoT and Data Science or Machine Learning?
  2. The purpose of this tutorial is educational and we are going to focus on two parts Theoretical and Practical.
  3. 5 or more min after this slide. Half of the presentation would be for the Theoretical part and another half for the Practical part
  4. Ashton wanted to attract attention to a new exciting technology called RFID. Because the internet was the hottest new trend in 1999 he called his presentation “Internet of Things”. The term did not get attention for the next 10 years. “Anything that can be connected, will be connected.” IoT is represented as a global network which intelligently connects all the objects, it is with self-configuring capabilities based on standard and interoperable protocols and formats. IoT has been called the Third Wave in information industry following the computer and the Internet. The Internet of Things (IoT) is the network of physical devices, vehicles, home appliances, and other items embedder with electronics, software, sensors, actuators, and connectivity which enables these things to connect and exchange data, creating opportunities for more direct integration of the physical world into computer-based systems, resulting in efficiency improvements, economic benefits, and reduced human exertions. Kevin Ashton is a cofounded the Auto-ID Center at the MIT, which created a global standard system for RFID and other sensors.
  5. IoT plays remarkable role in all aspects of our lives. For instance: Fitness trackers are telling us what is our heart rate, number of steps we did. Our car is showing us that gas level is low and giver recommendation to the nearest gas station on the road. We collectively start to love these devices as they can keep us in order and drive us closer to a future present only in our dreams.
  6. Lets look more closely to the scenario of Smart Home or Home Automation like some groups want to call it. This is illustration from EU IoT Initiative. The idea of the IoT comic book is to have an easy to read and understand publication about the IoT. The aim is to be serious and fun at the same time. And this is one of their showcases.
  7. 10 min after this slide As we saw earlier IoT is present to many areas, and it is becoming more and more popular. People are trying to automate things and processes in order to max safety and convenience. So this shows we need to pay attention to this things in general.
  8. Let’s now look at the IoT eco system to understand how it works. If we look at the image as one type of architecture model we have user with the phone connected to the Internet. Life cycle of sensing, computing, delivering, and presenting data. Lets look individual to each component.
  9. IoT sensors can be something like motion sensor that react when there is some movements in the area of influence. Or door and window sensor that reach when they are open. Or something like Raspberry Pi that we can even build even o our own and integrate any sensor on it such as temperature, light and so on. Even sensor that we can wear such as ECG and Blood Pressure.
  10. From the Network aspect … Users use smart phones, tablets, laptops to interact with other IoT devices indirectly through either a cloud backend or a gateway. Most used topologies – star, mesh, point-to-point.
  11. For instance we can have something like illustrated to this image, set of protocols and standards in one eco system.
  12. We can use Chromecast to stream video on our TV.
  13. Analytics can be on the device, edge level, or cloud level. It depends on the capabilities, architecture, type of analysis, and other factors.
  14. Network traffic between IoT devices can be very diverse. Some of the properties that are different are: network size, … Network traffic is an essential component of intelligent devices communication
  15. Life cycle of sensing, computing, delivering, and presenting data. So the idea is that this system of constantly sensing, analyzing and acting on it – gradually over time improves our life, production, spending … Between 15 – 20 min
  16. Security of information is the key point of any connected network, so our research moves towards make a more secure system than the existing and plays vital role in digital world.
  17. Node capture and eavesdropping - The attacker physically finds the information specifically environment and store the information away element for future work. - passive attack Controlling the data - attacker pick up an incomplete or full control over the IoT gadget - active attack Access attacks and privacy attacks - The intruder or unauthorized entity gain access to hardware components or network with no privilege to access. unapproved entity access to physical device, remote access, the intruder attacks the IP connected device. Complexity of vulnerability - System information assurance can be minimized by attacker due to vulnerability of the device or network. It can be shaped by utilizing three components that is (a) System defect (b) attacker access to that flaw and (c) attacker ability to use the flaw. Bandwidth constraint - transmitting a signal in addition network traffic is hopped 700%. More gadgets are associated in the web, the association needs to expand the Bandwidth and control the movement in the system. Resource allocation is an important during this bandwidth.  The Security model of IoT can be clarified by 3C's, Computation, Communication and Control.
  18. Based on those risks for instance some of the famous cybersecurity attacks are Mirai .. In 2016, IBM estimated that an average organization deals with over 200,000 security events per day. Such as: preventing violent images, scanning comments, detecting malware, fraudulent payments and compromised computers and devices. It’s everywhere.
  19. We can look at the network stack and identify potential security issues. Currently our interest is in IoT application security or more concrete Smart home security, which belong to the Application layer.
  20. Application layer is important because that is the payer that is used.
  21. Protocols are required in order to identify the spoken language of the IoT devices in terms of the format of exchanged messages, and select the correct boundaries that comply with the various functionality of each device. Protocols run on different layers and provide end-to-end communication. We are going to use later in our analysis some of those protocols. Protocols (https://www.postscapes.com/internet-of-things-protocols/) Infrastructure (ex: 6LowPAN, IPv4/IPv6, RPL) Identification (ex: EPC, uCode, IPv6, URIs) Communication / Transport (ex: Wi-Fi, Bluetooth, Z-Wave, ZigBee, KNX, Thread, LPWAN) Discovery (ex: Physical Web, mDNS, DNS-SD) Data Protocols (ex: MQTT, CoAP, AMQP, Websocket, Node) Device Management (ex: TR-069, OMA-DM) Semantic (ex: JSON-LD, Web Thing Model) Multi-layer Frameworks (ex: Alljoyn, IoTivity, Weave, Homekit)
  22. Before we start analyzing the data we need to collect the data. Those are some of the challenges. Importance of quality data collection plays a huge part in data analysis.
  23. 25 or more by this time Pattern discovery -> Discovery, Predictive Classification and Clustering -> Descriptive, Discovery, Predictive Association and Correlation -> Diagnostic, Discovery, Predictive Anomaly detection -> Discovery
  24. The process of applying data analytics methods to particular areas involves defining data types such as volume, variety, velocity, data models, and applying efficient algorithms that match the data characteristics. To understand which algorithm is one appropriate for processing and decision-making on generated smart data from the things in IoT. https://software.intel.com/en-us/articles/change-and-anomaly-detection-framework-for-internet-of-things-data-streams
  25. In the network traffic fields consecutive packets from the same flow might form an instance, while the set of features might include median IAT or standard deviation.
  26. What have been used so far in IoT network traffic analysis
  27. Other distributions are” Exponential, Weibull, gamma, normal, lognormal, logistic, log-logistic, Nakagami, Rayleigh, Rician, t-location scale, Birnbaum-Saunders, inverse Gaussian
  28. Using irrelevant or redundant features often leads to negative impacts on the accuracy of most ML algorithms. Select a subset of features that is small in size yet retains essential and useful information about the classes of interest.
  29. Supervised machine learning is the term for all algorithms that reason from externally supplied instances to produce general hypotheses, which then make educated conjectures about previously unseen instances.
  30. This goal is accomplished via training data for which the true classes are known. The resulting classifier is then used to predict class labels for instances of unknown class, and evaluated by various metrics of efficacy in this task.
  31. Explanation of Smart Home
  32. Making sense of these traffic could not only assist home users in understanding what is happening in home networks, but also help detect anomalous traffic toward home networks or originating from compromised home devices. The availability of the traffic monitoring platform makes it possible for us to analyze data traffic exchanged between home devices and Internet end hosts, as well data traffic exchanged among home network devices.
  33. After each testing round, a hard reset of the tested device was performed to return it to its default factory settings.
  34. You can try on your own to create while and for loops.
  35. 40 – 50 min
  36. https://www.c-mric.com/wp-content/uploads/2018/06/Basil_Cyberincident2018.pdf Statistical distribution and the autocorrelation function of the traffic traces: Kolmogorov-Smirnov goodness-of-fit test, autocorrelation functions, wavelet-based estimation of the Hurst parameter.
  37. https://www.c-mric.com/wp-content/uploads/2018/06/Basil_Cyberincident2018.pdf 3. Statistical distribution and the autocorrelation function of the traffic traces: Kolmogorov-Smirnov goodness-of-fit test, autocorrelation functions, wavelet-based estimation of the Hurst parameter.
  38. https://www.c-mric.com/wp-content/uploads/2018/06/Basil_Cyberincident2018.pdf 3. Statistical distribution and the autocorrelation function of the traffic traces: Kolmogorov-Smirnov goodness-of-fit test, autocorrelation functions, wavelet-based estimation of the Hurst parameter.
  39. There are many ways that help with understanding the data and find some of the patterns. Other option for the image - https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjN1ZuE4ZXdAhXrct8KHecbBJwQjRx6BAgBEAU&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D00zjDdXUcy4&psig=AOvVaw0LutuIj0HB1WkcuRDAANqc&ust=1535751311761729
  40. s
  41. https://www.c-mric.com/wp-content/uploads/2018/06/Basil_Cyberincident2018.pdf 3. Statistical distribution and the autocorrelation function of the traffic traces: Kolmogorov-Smirnov goodness-of-fit test, autocorrelation functions, wavelet-based estimation of the Hurst parameter.
  42. covariance is a measure of the joint variability of two random variables, SD - measure that is used to quantify the amount of variation or dispersion of a set of data values.
  43. Machine learning works on a simple rule – if you put garbage in, you will only get garbage to come out. By garbage here, I mean noise in data. This becomes even more important when the number of features are very large. You need not use every feature at your disposal for creating an algorithm. You can assist your algorithm by feeding in only those features that are really important. You not only reduce the training time and the evaluation time, you also have less things to worry about! Top reasons to use feature selection are: It enables the machine learning algorithm to train faster. It reduces the complexity of a model and makes it easier to interpret. It improves the accuracy of a model if the right subset is chosen. It reduces overfitting. Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from -1 to +1.  If you are working with a model which assumes the linear relationship between the dependent variables, correlation can help you come up with an initial list of importance. It also works as a rough list for nonlinear models. The idea is that those features which have a high correlation with the dependent variable are strong predictors when used in a model. Let us generate a random dataset for this article.
  44. http://dataaspirant.com/2018/01/15/feature-selection-techniques-r/
  45. https://www.kaggle.com/nikhilesh87/easy-feature-selection-for-beginners-in-r
  46. https://www.c-mric.com/wp-content/uploads/2018/06/Basil_Cyberincident2018.pdf 3. Statistical distribution and the autocorrelation function of the traffic traces: Kolmogorov-Smirnov goodness-of-fit test, autocorrelation functions, wavelet-based estimation of the Hurst parameter.
  47. These metrics are defined as follows: • False Negatives  (FN): Percentage of members of class X incorrectly classified as not belonging to class X. • False Positives  (FP): Percentage of members of other classes incorrectly classified as belonging to class X. • True Positives  (TP): Percentage of members of class X correctly classified as belonging to class X (equivalent to 100% - FN ). • True Negatives  (TN): Percentage of members of other classes correctly classified as not belonging to class X (equivalent to 100% - FP ).
  48. https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
  49. 50 – 70/80
  50. 70/80 – 75/85
  51. There are few references that I have selected are interesting if you want to learn and explore more.
  52. 12:10 – 12:20