Data_Analytics_for_IoT_Solutions.pptx.pdf

MODULE: 6 DATA
ANALYTICS FOR IOT
SOLUTIONS

MODULE: 6 DATA ANALYTICS FOR IOT SOLUTIONS
• Data generation, Data gathering, Data Pre-processing, data analyzation, application of analytics,
Exploratory Data Analysis, vertical-specific algorithms.

LIFECYCLE OF THE DATA SCIENCE PROJECT

APPLICATION OF EXPLORATORY DATA
ANALYSIS (EDA)
• What is EDA ?
• Aim of EDA
• EDA Tools
• EDA Techniques
• EDA vs CDA
• Steps of EDA
• Application of EDA with personal email

WHAT IS EXPLORATORY DATA ANALYSIS
REF:HTTPS://WWW.IBM.COM/CLOUD/LEARN/EXPLORATORY-DATA-ANALYSIS#TOC-TYPES-OF-E-64HSTW2A
• Exploratory data analysis (EDA) is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often
employing data visualization methods
• It helps determine how best to manipulate data sources to get the answers
you need, making it easier for data scientists to discover patterns, spot
anomalies, test a hypothesis, or check assumptions

WHAT IS EXPLORATORY DATA ANALYSIS
• EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis
testing task and provides a better understanding of data set variables and the relationships
between them
• It can also help determine if the statistical techniques you are considering for data analysis
are appropriate
• Originally developed by American mathematician John Tukey in the 1970s, EDA techniques
continue to be a widely used method in the data discovery process today.

AIM OF EDA
• Maximize insight into a dataset
• Uncover underlying structure (relationship)
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)

VISUAL AIDS FOR EDA
• Line chart
• Bar chart
• Scatter plot
• Area plot and
• stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
• Lollipop chart

LINE CHART
A line chart is used to illustrate the relationship between two or more continuous variables.

Scatter plots can be constructed in the following two
situations:
1) When one continuous variable is dependent on another variable, which is under
the control of the observer
2) When both continuous variables are independent
There are two important concepts—independent variable and dependent variable. In
statistical modeling or mathematical modeling, the values of dependent variables rely
on the values of independent variables. The dependent variable is the outcome
variable being studied. The independent variables are also referred to as regressors.
The takeaway message here is that scatter plots are used when we need to show the
relationship between two variables, and hence are sometimes referred to as
correlation plots.

A bubble plot is a manifestation of the scatter plot where each data point on the graph is shown as a
bubble. Each bubble can be illustrated with a different colour, size, and appearance.

• The stacked plot can be useful when we want to visualize the cumulative effect of multiple variables
being plotted on the y axis.
• The purpose of the pie chart is to communicate proportions and it is
• widely accepted.
• Histogram plots are used to depict the distribution of any continuous variable. These types of plots are
very popular in statistical analysis
• A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar chart.

EXPLORATORY VS CONFIRMATORY
DATA ANALYSIS
EDA CDA
No hypothesis at first Start with hypothesis
Generate hypothesis Test the null hypothesis
Uses graphical methods (mostly) Uses statistical models

Missing Values
• If there are missing values in the Dataset before doing any statistical analysis, we
need to handle those missing values.
There are mainly three types of missing values.
• MCAR(Missing completely at random): These values do not depend on any other
features.
• MAR(Missing at random): These values may be dependent on some other features.
• MNAR(Missing not at random): These missing values have some reason for why
they are missing.

HANDLING OUTLIERS
• Outliers are the values that are far beyond the next nearest data points.
• There are two types of outliers:
• Univariate outliers: Univariate outliers are the data points whose values lie
beyond the range of expected values based on one variable.
• Multivariate outliers: While plotting data, some values of one variable may not
lie beyond the expected range, but when you plot the data with some other
variable, these values may lie far from the expected value

Box plot: In descriptive statistics, a box plot is a method for graphically depicting groups of
numerical data through their quartiles. Box plots may also have lines extending vertically
from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence
the terms box-and-whisker plot and box-and-whisker diagram. Outliers will appear separate
from the plot.
Outliers can be dropped only if it is a garbage value. Example: height of an adult = 0 ft. This
cannot be true, as the height cannot be a string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all the data points
are clustered between zero to 10, but one point lies at 100, then we can remove this point.
If you cannot drop outliers, you can normalize the data. This way, the extreme data points
are pulled to a similar range.

• Normal Distribution or Symmetric Distribution : If a box plot has equal
proportions around the median, we can say distribution is symmetric or
normal.
• Positively Skewed : For a distribution that is positively skewed, the box plot will
show the median closer to the lower or bottom quartile.
A distribution is considered "Positively Skewed" when mean > median. It means
the data constitute higher frequency of high valued scores.
• Negatively Skewed : For a distribution that is negatively skewed, the box plot
will show the median closer to the upper or top quartile.
A distribution is considered "Negatively Skewed" when mean < median. It means
the data constitute higher frequency of low valued scores.

DIFFERENTIATE BETWEEN UNIVARIATE, BIVARIATE, AND MULTIVARIATE ANALYSIS
•Univariate – When we analyze one variable at a time, it is called univariate data analysis. This
analysis aims to describe the variable in question and find patterns that exist within it.
Example: height of students
•Bivariate – Bivariate data involves two different variables. The analysis of this type of data
deals with causes and relationships. The investigation determines the relationship between the
two variables, where one of the variables is the target variable. Example: temperature and ice
cream sales in the summer season.
•Multivariate – Analyzing three or more variables together is categorized under multivariate
data analysis. It is similar to a bivariate but contains more than one dependent variable.
Example: data for house price prediction

MENTION THE TWO KINDS OF TARGET VARIABLES FOR
PREDICTIVE MODELING
• Numerical/Continuous variable – Variables whose values lie within a range, could be any value in
that range and the time of prediction, values are not bound to be from the same range too.
• For example: Height of students – 5; 5.1; 6; 6.7; 7; 4.5; 5.11
Here the range of the values is (4,7)
And, the height of some new students can/cannot be any value from this range.
• Categorical variable – Variables that can take on one of a limited, and usually fixed, number of
possible values, assigning each individual or other unit of observation to a particular group on the
basis of some qualitative property.
A categorical variable that can take on exactly two values is termed a binary variable or a
dichotomous variable. Categorical variables with more than two possible values are called
polytomous variables
• For example: Exam Result: Pass, Fail (Binary categorical variable)
The blood-type of a person: A, B, O, AB (polytomous categorical variable)

UNIVARIATE ANALYSIS FOR NUMERICAL DATA USING BOX PLOT AND KDE (KERNEL DENSITY ESTIMATE)
Box plot and KDE both show that an average population age lies between 25yrs to 50 yrs
roughly, and the mean of the population is 38yrs. The left skewness in the KDE plot shows
that more population was between 20 and 30 years and very few aged people were in the
sample, which could be verified from the box plot too, as the box is aligned more towards the
Q1 and not evenly distributed

UNIVARIATE ANALYSIS FOR
CATEGORICAL VARIABLES
Bar plots and Pie Charts are a great way to analyze categorical variables to
understand the categorical data.
Here, the Barplot and Pie chart shows that “Course” Course_Type was highest in
number with 51.3 % people subscribing to such courses, followed by “Program”
Course_Type, with the least number of “Degree” Course_Type with only 0.3%
subscribing to such courses

BIVARIATE ANALYSIS FOR NUMERICAL-NUMERICAL
• Bivariate analysis can be performed for any two sets of variables, Bivariate analysis is
performed using an independent variable and the dependent variable.
• Numerical-Numerical – Here, one of the numerical variables is the target variable and the
other one is any other independent numerical variable. A Scatter plot is a great way for
understanding numerical-numerical variable data relationships. In the example shown, sales is
the target numerical variable plotted on the y-axis against user-traffic numerical variables on
the x-axis
Types:
Scatter Plot
Pair Plot
Correlation Matrix

The scatter plot helps us in
understanding that User_Traffic
is increasing linearly as the
Sales going up

BIVARIATE CATEGORICAL —
CATEGORICAL ANALYSIS
One of the Categorical variables is the target variable and another one can be
an independent categorical variable. In the example below, the target variable
is about default next month represented by either 0 or 1 against the
education categorical independent variable.
double bar or stacked bar charts. Here, we can see how
defaulters( represented by 1: orange color) are highest in number for High
School then University and then for Others category even when they are so
less in number.

Numerical-Categorical – Here, the target variable is either categorical or numerical, and in
such case, bar plots or strip plots are a great way of understanding the data. Below is an
example for a bar and strip plot where sales which is the numerical variable(target) is on the
y-axis and course_domain is the categorical variable represented on the x-axis.
Bar Plot helps in understanding that sales for “Business” course_domain give the highest
sales followed by Finance, the Development and the least sales from Software
course_domain. The business gives the highest sales, and the strip plot corresponding to the
same helps in understanding the minimum value of sale for this category is quite high if
compared with others and maximum sales value is low than others, but at last, gives the
most sales

A LOLLIPOP CHART CAN BE USED TO DISPLAY RANKING IN THE DATA. IT IS SIMILAR TO AN ORDERED BAR CHART.

HOW CAN THE DATA BE NORMALIZED/
FEATURE SCALING?
• Data can be normalized by either transforming the data or by
scaling the data down in a particular range.
1. Transformation – If the data is left-skewed (negtaive skew),
log transformation is the best way to make them behave in the
normal distribution, and if the data is right-skewed (positive
skew), exponential transformation helps in transforming them
into a normal distribution.

2. Scaling – There are two scalers used on a wide base
• Normalization (Min-Max Scaler): This scales down the data between 0 and 1 range
where minimum value corresponds to 0 and maximum was 1.
A value is normalized as follows: y = (x – min) / (max – min), where the minimum and
maximum values pertain to the value x being normalized
• Standardization (Standard Scaler): This scaler helps in making a normal distribution in
standard normal distribution where the mean is represented by 0 and the standard
deviation is represented by 1.A value is standardized as follows: y = (x – mean) /
standard_deviation
Note: If the distribution of the quantity is normal, then it should be standardized, otherwise,
the data should be normalized. Standardization can give values that are both positive and
negative centered around zero. It may be desirable to normalize data after it has been
standardized.

• Normalization is used when the data values
are skewed and do not follow gaussian distribution.
• The data values get converted between a range of 0 and 1.
• Normalization makes the data scale free.

• Standardization is used on the data values that are normally
distributed. Further, by applying standardization, we tend to
make the mean of the dataset as 0 and the standard deviation
equivalent to 1.
• That is, by standardizing the values, we get the following
statistics of the data distribution
• mean = 0
• standard deviation = 1
• the data set becomes self explanatory and easy to analyze as
the mean turns down to 0 and it happens to have an unit
variance.

EXPLORATORY DATA ANALYSIS TECHNIQUES
• There are four exploratory data analysis techniques that data experts use, which include:
• Univariate Non-Graphical
• This is the simplest type of EDA, where data has a single variable. Since there is only one variable, data professionals do not have to
deal with relationships.
• Univariate Graphical
• Non-graphical techniques do not present the complete picture of data. Therefore, for comprehensive EDA, data specialists
implement graphical methods, such as stem-and-leaf plots, box plots, and histograms.

EXPLORATORY DATA ANALYSIS TECHNIQUES
• Multivariate Non-Graphical
• Multivariate data consists of several variables. Non-graphic multivariate EDA methods illustrate relationships between 2 or more
data variables using statistics or cross-tabulation
• Multivariate Graphical
• This EDA technique makes use of graphics to show relationships between 2 or more datasets. The widely-used multivariate graphics
include bar chart, bar plot, heat map, bubble chart, run chart, multivariate chart, and scatter plot.

STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain descriptive statistics. Try to understand
the data structure, relationships, anomalies, unexpected behaviors

STEPS OF EDA
• Try to identify confusing variables, interaction relations and multicollinearity, if any.
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables)
• Decide on the hypothesis based on your research questions

AFTER EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical analysis
• Get conclusions and present the results using various graphical representation

APPLICATION OF EDA
HANDS-ON
TOPIC COVERED:
• LOADING THE DATASET
• DATA TRANSFORMATION
• DATA ANALYSIS
OUTCOME:
YOU WILL LEARN ABOUT HOW TO EXPORT ALL YOUR EMAILS AS A DATASET, HOW TO USE IMPORT THEM INSIDE A
PANDAS DATAFRAME, HOW TO VISUALIZE THEM, AND THE DIFFERENT TYPES OF INSIGHTS YOU CAN GAIN.
REF: BOOK: HANDS-ON EXPLORATORY DATA ANALYSIS WITH PYTHON BY SURESH KUMAR MUKHIYA USMAN AHMED.
CHAPTER 3

EDA WITH PERSONAL EMAIL- STEP 1
1.Here are the steps to follow (Data generation and collection):
a) 1. Log in to your personal Gmail account.
b) 2. Go to the following link: https://takeout.google.com/settings/takeout
c) 3. Deselect all the items but Gmail, as shown in the following screenshot:

EDA WITH PERSONAL EMAIL-STEP 1
d. Select the archive format, as shown in the following screenshot
• Note that I selected Send download link by email, One-
time archive, .zip, and the maximum allowed size.
• You can customize the format. Once done, hit Create
archive
• You will get an email archive that is ready for download.
You can use the path to the mbox file for further
analysis, which will be discussed further.

Loading the dataset
• I loaded my own personal email from Google Mail. For privacy reasons, You shouldn't share the
dataset. However, I will show you different EDA operations that you can perform to analyze several
aspects of your email behavior:
1. Let's load the required libraries:
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• Note that for this analysis, we need to have the mailbox package installed. If it is not installed on your system, it can be
added to your Python build using the pip install mailbox instruction.

Loading the dataset
2. When you have loaded the libraries, load the dataset:
• import mailbox
• mboxfile = "PATH TO DOWNLOADED MBOX FIL"
• mbox = mailbox.mbox(mboxfile)
• mbox
• Note that it is essential that you replace the mbox file path with your own path.
• The output of the preceding code is as follows:
<mailbox.mbox at 0x7f124763f5c0>
• The output indicates that the mailbox has been successfully created.

Loading the dataset
3. Next, let's see the list of available keys::
for key in mbox[0].keys():
print(key)
• The preceding output shows the list of keys that are present in
the extracted dataset.

A. Data Transformation
• Although there are a lot of
objects returned by the
extracted data, we do not need
all the items. We will only
extract the required fields.
• Data cleansing is one of the
essential steps in the data
analysis phase.
• For our analysis, all we need is
data for the following: subject,
from, date, to, label, and
thread.
B. Data cleansing
Let's create a CSV file with only the required fields. Let's start with the
following steps
1. Import the csv package:
• import csv
2. Create a CSV file with only the required attributes:
with open('mailbox.csv', 'w') as outputfile:
writer = csv.writer(outputfile)
writer.writerow(['subject','from','date','to','label','thread'])
for message in mbox:
writer.writerow([
message['subject'],
message['from'],
message['date'],
message['to'],
message['X-Gmail-Labels'],
message['X-GM-THRID']
]
)
• The preceding output is a csv file named mailbox.csv. Next, instead of loading the
mbox file, we can use the CSV file for loading, which will be smaller than the
original dataset.

C. Loading the CSV file
• We will load the CSV file.
Refer to the following code
block:
dfs=pd.read_csv('mailbox.csv',
names=['subject', 'from', 'date',
'to',
'label', 'thread'])
• The preceding code will
generate a pandas data frame
with only the required fields
stored in the CSV file
D. Converting the date
• Next, we will convert the date.
• Check the datatypes of each column as shown here:
• dfs.dtypes
• Note that a date field is an object. So, we need to convert it into a
DateTime argument.
• In the next step, we are going to convert the date field into an
actual DateTime argument. We can do this by using the pandas
to_datetime() method. See the following code:
dfs['date'] = dfs['date'].apply(lambda x: pd.to_datetime(x,
errors='coerce', utc=True))

E. Removing NaN values
• Next, we are going to remove NaN values from the field.
• We can do this as follows:
• dfs = dfs[dfs['date'].notna()]
• Next, it is good to save the preprocessed file into a separate CSV file in case we need it again.
• We can save the data frame into a separate CSV file as follows:
• dfs.to_csv('gmail.csv')

• Applying descriptive statistics
• Having preprocessed the dataset, let's
do some sanity checking using
descriptive statistics techniques
• We can implement this as shown here:
dfs.info()
• The output of the preceding code is as
follows:
• Let's check the first few entries of the email dataset:
dfs.head(10)
• Note that our data frame so far contains six different columns. Take a look at
the from field:
• It contains both the name and the email. For our analysis, we only need an
email address. We can use a regular expression to refactor the column.

• Data refactoring
1. First of all, import the regular
expression package:
import re
2. Next, let's create a function that takes
an entire string from any column and
extracts an email address:
def extract_email_ID(string):
email = re.findall(r'<(.+?)>', string)
if not email:
email = list(filter(lambda y: '@' in y,
string.split()))
return email[0] if email else np.nan
3. Next, let's apply the function to the from column:
dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))
4. Next, we are going to refactor the label field. The logic is
simple. If an email is from your email address, then it is the sent
email. Otherwise, it is a received email, that is, an inbox email:
myemail = 'itsmeskm99@gmail.com'
dfs['label'] = dfs['from'].apply(lambda x: 'sent' if x==myemail
else 'inbox')

• Dropping columns
1. Note that the to column only contains your own email. So, we can drop this
dfs.drop(columns='to', inplace=True)
2. This drops the to column from the data frame. Let's display the first 10 entries now:
dfs.head(10)
The output of the preceding code is as follows: Check the preceding output. The fields are cleaned. The data is transformed into
the correct format.

• Refactoring timezones
1. We can refactor timezones by using the method given here:dfs.drop(columns='to', inplace=True)
import datetime
import pytz
def refactor_timezone(x):
est = pytz.timezone('US/Eastern')
return x.astimezone(est)

DATA ANALYSIS
• This is the most important part of EDA. This is the part where we gain insights from the data that we have.
• Let's answer the following questions one by one:
1. How many emails did I send during a given timeframe?
2. At what times of the day do I send and receive emails with Gmail?
3. What is the average number of emails per day?
4. What is the average number of emails per hour?
5. Whom do I communicate with most frequently?
6. What are the most active emailing days?
7. What am I mostly emailing about?
• In the following sections, we will answer the preceding questions.

AVERAGE EMAILS PER DAY AND HOUR

AVERAGE EMAILS PER DAY AND HOUR
• The average emails per hour and
per graph is illustrated by the
preceding graph.
• In my case, most email
communication happened
between 2018 and 2020.

NUMBER OF EMAILS PER DAY
• Let's find the busiest day of the week in terms of emails:
• counts = dfs.dayofweek.value_counts(sort=False)
• counts.plot(kind='bar')

SUMMARY
• We imported data from our own Gmail accounts in mbox format.
• We loaded the dataset and performed some primitive EDA techniques, including data loading, data
transformation, and data analysis.
• We also tried to answer some basic questions about email communication.

VERTICAL VS. HORIZONTAL DATA SCIENTISTS
Vertical data scientists have very deep knowledge in some narrow field.
They might be computer scientists very familiar with computational complexity of all sorting algorithms
Or
Software engineer with years of experience writing Python code (including graphic libraries) applied to API
development and web crawling technology
OR
Database guy with strong data modeling, data warehousing, graph databases, Hadoop and NoSQL
expertise. Or a predictive modeler expert in Bayesian networks, SAS and SVM.

HORIZONTAL DATA SCIENTISTS
• They are a blend of business analysts, statisticians, computer scientists and domain experts. They
combine vision with technical knowledge
• They know about more modern, data-driven techniques applicable to unstructured, streaming, and big
data
• They can design robust, efficient, simple, replicable and scalable code and algorithms.

Horizontal data scientists also come with the following features:
• They have some familiarity with six sigma concepts. In essence, speed is more important than perfection, for these analytic practitioners.
• They have experience in producing success stories out of large, complicated, messy data sets - including in measuring the success.
• Experience in identifying the real problem to be solved, the data sets (external and internal) they need, the data base structures they need, the
metrics they need, rather than being passive consumers of data sets produced or gathered by third parties lacking the skills to collect / create
the right data.
• They know rules of thumb and pitfalls to avoid, more than theoretical concepts. However they have a bit more than just basic knowledge of
computational complexity, good sampling and design of experiment, robust statistics and cross-validation, modern data base design and
programming languages (R, scripting languages, Map Reduce concepts, SQL)
• Advanced Excel and visualization skills.
• They can help produce useful dashboards (the ones that people really use on a daily basis to make decisions) or alternate tools to
communicate insights found in data (orally, by email or automatically - and sometimes in real time machine-to-machine mode).
• They think outside the box.
• They are innovators who create truly useful stuff

Vertical data scientists are the by-product of our rigid University system which trains people to become either
a computer scientist, a statistician, an operations research or a MBA guy - but not all the four at the same time.
This is one of the reasons of offering data science program and why recruiters can't find data scientists.
Mostly they find and recruit vertical data scientists. Companies are not yet used to identifying horizontal data
scientists - the true money makers and ROI generators among analytic professionals.

VERTICAL-SPECIFIC ALGORITHMS (ML
WORKFLOW)

DETAILED CLASSIFICATION OF ML
TECHNIQUES
• Part 1: Data Pre-processing (before ML)
• Part 2: Regression
 Simple Linear Regression
 Multiple Linear Regression
 Polynomial Regression
 Support Vector Regression (SVR)
 Decision Tree Regression
 Random Forest Regression
 Evaluating Regression Models Performance(KPI): R^2, RMSE, Cross fold validation and Score
74
06/17/2021

CONT..
Part 3: Classification
 Logistic Regression
 K-Nearest Neighbors (K-NN)
 Support Vector Machine (SVM)
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Random Forest Classification
 Evaluating Classification Models Performance: Confusion matrix, Recall, Precision, Sensitivity / specificity
75
06/17/2021

CONT..
Part 4: Clustering
 K-Means Clustering
 Hierarchical Clustering
Part 5: Association Rule Learning
Part 6: Reinforcement Learning
Part 7: Natural Language Processing
Part 8: Deep Learning:
 Artificial Neural Networks (ANN)
 Convolutional Neural Networks (CNN)
76
06/17/2021

CONT..
Part 9: Dimensionality Reduction
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Kernel PCA
Part 10: Model Selection and Deployment
77
06/17/2021

REFERENCE
• Book: Hands-On Exploratory Data Analysis with Python by Suresh Kumar Mukhiya Usman Ahmed. Chapter 3 & 11.
• https://www.datasciencecentral.com/profiles/blogs/vertical-vs-horizontal-data-scientists.
• https://towardsdatascience.com/vertical-vs-horizontal-ai-startups-e2bdec23aa16

Data_Analytics_for_IoT_Solutions.pptx.pdf

More Related Content

Similar to Data_Analytics_for_IoT_Solutions.pptx.pdf

Recently uploaded

Data_Analytics_for_IoT_Solutions.pptx.pdf