Datasets using R-Studio
Usha Rani Singh
1
Datasets for cars
Dataset is a collection of related information which is useful to analyze data and derive the outputs
The dataset contains information in various forms, and it isn't straightforward for the analyzer to extract the data and present it to the business
2
Preparing Dataset for cars
Preparing and analyzing the dataset is very important for any threat information, which helps to provide accurate data
We have to consider the data which provide more value or relevant for the problem
Categorize the data into regression, classification, clustering, and ranking
It is difficult to establish data collection mechanism and data is scattered into various forms and departments
We have to make consistency in the data
Data sample has been reduced, and at the same time it should consist of the required information
Preparing Dataset for cars
We have to clean the data so that the processing time will be faster and accurate
Complex datasets have to be decomposed into multiple parts
Data normalization has to be performed to improve the quality of the data
R Studio for dataset
Pie Diagram for data set
ggplot_1 for dataset
ggplot_2 for data set
ggplot_3 for dataset
Dataset used
Thank you!
11
Microsoft
Excel Worksheet
Microsoft
Excel Worksheet
Week 10 – Analysing Data sets in RapidMiner
The data sets used for this weeks analysis relates to the CSRIC best practices:
The CSRIC Best Practices Search Tool allows you to search CSRIC's collection of Best Practices using a variety of criteria including Network Type, Industry Role, Keywords, Priority Levels, and BP Number. The Communications Security, Reliability and Interoperability Council's (CSRIC) mission is to provide recommendations to the FCC to ensure, among other things, optimal security and reliability of communications systems, including telecommunications, media, and public safety. CSRIC’s members focus on a range of public safety and homeland security-related communications matters, including: (1) the reliability and security of communications systems and infrastructure, particularly mobile systems; (2) 911, Enhanced 911 (E911), and Next Generation 911 (NG911); and (3) emergency alerting.
The CSRIC's recommendations will address the prevention and remediation of detrimental cyber events, the development of best practices to improve overall communications reliability, the availability and performance of communications services and emergency alerting during natural disasters, terrorist attacks, cyber security attacks or other events that result in exceptional strain on the communications infrastructure, the rapid restoration of communications services in the event of widespread or major disruptions and the steps communications providers can take to help secure end-users and servers.
I have used RapidMiner to analyze the data se.
1. Datasets using R-Studio
Usha Rani Singh
1
Datasets for cars
Dataset is a collection of related information which is useful to
analyze data and derive the outputs
The dataset contains information in various forms, and it isn't
straightforward for the analyzer to extract the data and present
it to the business
2
Preparing Dataset for cars
2. Preparing and analyzing the dataset is very important for any
threat information, which helps to provide accurate data
We have to consider the data which provide more value or
relevant for the problem
Categorize the data into regression, classification, clustering,
and ranking
It is difficult to establish data collection mechanism and data is
scattered into various forms and departments
We have to make consistency in the data
Data sample has been reduced, and at the same time it should
consist of the required information
Preparing Dataset for cars
We have to clean the data so that the processing time will be
faster and accurate
Complex datasets have to be decomposed into multiple parts
Data normalization has to be performed to improve the quality
of the data
R Studio for dataset
3. Pie Diagram for data set
ggplot_1 for dataset
ggplot_2 for data set
ggplot_3 for dataset
5. Microsoft
Excel Worksheet
Microsoft
Excel Worksheet
Week 10 – Analysing Data sets in RapidMiner
The data sets used for this weeks analysis relates to the CSRIC
best practices:
The CSRIC Best Practices Search Tool allows you to search
CSRIC's collection of Best Practices using a variety of criteria
including Network Type, Industry Role, Keywords, Priority
Levels, and BP Number. The Communications Security,
Reliability and Interoperability Council's (CSRIC) mission is to
provide recommendations to the FCC to ensure, among other
things, optimal security and reliability of communications
systems, including telecommunications, media, and public
safety. CSRIC’s members focus on a range of public safety and
homeland security-related communications matters, including:
(1) the reliability and security of communications systems and
infrastructure, particularly mobile systems; (2) 911, Enhanced
911 (E911), and Next Generation 911 (NG911); and (3)
emergency alerting.
6. The CSRIC's recommendations will address the prevention and
remediation of detrimental cyber events, the development of
best practices to improve overall communications reliability,
the availability and performance of communications services
and emergency alerting during natural disasters, terrorist
attacks, cyber security attacks or other events that result in
exceptional strain on the communications infrastructure, the
rapid restoration of communications services in the event of
widespread or major disruptions and the steps communications
providers can take to help secure end-users and servers.
I have used RapidMiner to analyze the data set :
The statistical view of various names, types and attributes
related to the data set.
Visualization of public safety vs prioritization
7. Overall prioritization pie chart
Bar graph comparing various network types and internet/data
usage
customer-segmentation-data set.zip
Mall_Customers.csv
CustomerID,Gender,Age,Annual Income (k$),Spending Score
(1-100)
1,Male,19,15,39
2,Male,21,15,81
3,Female,20,16,6
4,Female,23,16,77
5,Female,31,17,40
6,Female,22,17,76
7,Female,35,18,6
19. Mall Customer Segment Data Analysis using RFM
Vivek Ijjagiri
Agenda
2
Introduction
Mall Customer Segmentation data
Mall Customer Segment analysis data using RFM
Problem Solving
20. Clustering
Conclusion
References
Introduction
When we want to increase the sales we need to do planning for
marketing spend, or while formulating a new promotion, as a
retail marketer we have to be more careful about how we
segment and target the customers. It would be a waste of time
and money if, for example, we launch an ad campaign that is
central to a lot of customers. Such untargeted marketing and
advertising is not likely to have a high conversion fee and may
additionally even hurt our company value.
Retailers now use sophisticated strategies to section their
customers and goal their marketing efforts to these segments.
RFM analysis is one such famous patron segmentation technique
that can assist shops to maximize the return on their advertising
investments.
Why RFM.?
Improving customer segmentation marketing and widely used
for surveys.
Superior and simplistic compared to other methods.(CHAID and
21. logistic regression)
Focuses on transaction information and delivering better
marketing to customers.
What is RFM?
R => Recency
F => Frequency
M=> Monetary
How are we using the RFM and target customers?
Simple we score the customers based on the RFM from high to
low.
Greater the score there’s likely more chance to buy a product or
take a new offer or promotion.
It’ll help us identify customers that are most likely to respond
to a new offer or promotion.
Identifying the most valuable RFM segments can capitalize on
chance relationships in the data used for this analysis.
22. Mall Customer Segment analysis data using RFM
7
Recency: Recency is most important predictor of customers who
did the purchases recently. Customers who have purchased
recently a product are more likely to purchase again from your
store/mall compared to those who did not purchase recently.
Frequency: The second most important factor is how frequently
these customers purchase from you. The higher the frequency,
the higher of chances of them purchasing the products again.
Monetary: The third factor is the amount of money these
customers have spent on purchases. Customers who have spent
higher are more likely to purchase based on their recent
purchase compared to those who have spent less.
How are we going to calculate RFM?
To implement the RFM analysis, we need to further process the
data set in by the following steps:
Find the most recent date for each ID and calculate the days to
the now or some other date, to get the Recency data
Calculate the quantity of translations of a customer, to get the
Frequency data
Sum the amount of money a customer spent and divide it by
23. Frequency, to get the amount per transaction on average, that is
the Monetary data.
8
Problem Solving
Make sure we have the following libraries to procced with the
data analysis, if the libraries not found in your R Studio install
those packages.
library(data.table)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(rmarkdown)
9
Load and examine data
> Mall_Customers<- fread('data.csv’)
> glimpse(Mall_Customers)
24. Ijjagiri, Vivek (IV) - This is like a transposed version of print:
columns run down the page, and data runs across. This makes it
possible to see every column in a data frame. It's a little like str
applied to a data frame but it tries to show you as much data as
possible. (And it always shows the underlying data, even when
applied to a remote data source.)
View Data
14
Data Cleanup
Or
WRangle
15
> Mall_Customers<- Mall_Customers%>%
mutate(Quantity = replace(Quantity, Quantity<=0, NA),
UnitPrice = replace(UnitPrice, UnitPrice<=0, NA))
26. > summary(df_RFM)
17
Calculate RFM
> kable(head(df_RFM))
18
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.
The objective of K-means is simple: group similar data points
together and discover underlying patterns.
To achieve this objective, K-means looks for a fixed number (k)
of clusters in a dataset.”
A cluster refers to a collection of data points aggregated
together because of certain similarities.
In other words, the K-means algorithm identifies k number of
centroids, and then allocates every data point to the nearest
cluster, while keeping the centroids as small as possible.
27. K Means Clustering Algorithm
1.Specify number of clusters K.
2.Initialize centroids by first shuffling the dataset and then
randomly selecting K data points for the centroids without
replacement.
3.Keep iterating until there is no change to the centroids. i.e
assignment of data points to clusters isn’t changing.
K Means clustering algorithm
Recency
Recency – How recently did the customer purchase?
> Customer_Purchase_Recency <- df_RFM$recency
> hist(Customer_Purchase_Recency, main = 'Recency')
20
Frequency
Frequency – How often do they purchase?
> Customer_Purchase_Frequency <- df_RFM$frequency
> hist(Customer_Purchase_Frequency, main = ‘Frequency')
21
28. Monetary
Monetary Value – How much do they spend?
> Customer_Purchase_Monitery <- df_RFM$monitery
> hist(Customer_Purchase_Monitery, main = ‘Monetary’,
breaks=50 )
22
Monetary Log
Because the data is skewed, we use log scale to normalize
> MoniteryLog <- log(df_RFM$monitery)
> hist(MoniteryLog, main ='MoniteryLog')
23
Ijjagiri, Vivek (IV) -
https://www.rdocumentation.org/packages/amap/versions/0.8-
17/topics/hcluster
Ijjagiri, Vivek (IV) - This function is a mix of function hclust
and function dist. hcluster(x, method = "euclidean",link =
"complete") = hclust(dist(x, method = "euclidean"),method =
"complete")) It use twice less memory, as it doesn't store
distance matrix.
For more details, see documentation of hclust and Dist.
Clustering
> DataFrame_Clustering <- df_RFM
29. > DataFrame_CustomerID <-
DataFrame_Clustering$CustomerID
> row.names(DataFrame_Clustering) <- DataFrame_CustomerID
> DataFrame_CustomerID <- NULL
> DataFrame_Clustering <- scale(DataFrame_Clustering)
> summary(DataFrame_Clustering )
24
Clustering
> d <- dist(DataFrame_Clustering)
> c <- hclust(d, method = 'ward.D2’)
> Plot(c)
25
Ijjagiri, Vivek (IV) - A dendrogram is a diagram that shows the
hierarchical relationship between objects. It is most commonly
created as an output from hierarchical clustering. The main use
of a dendrogram is to work out the best way to allocate objects
to clusters. The dendrogram below shows the hierarchical
clustering of six observations shown to on the scatterplot to the
left. (Dendrogram is often miswritten as dendogram.)
Plotting with less data
30. 26
Plotting with less data
27
Plotting with less data
28
Conclusion
Customer segmentation process can be performed using various
clustering algorithms.
We focused on k-means clustering in R.
The algorithm is quite simple to implement. However,
representing data in the correct format and interpreting results
is the difficult part.
31. RFM Analysis can segment customers, design offers,
promotions specific to audience and produce products based on
customer profile and interests.
References
Shubhankar Rawat (May 2019), Mall Customers Segmentation
— Using Machine Learning retrieved from
https://towardsdatascience.com/mall-customers-segmentation-
using-machine-learning-274ddf5575d5
What is market segmentation, Different types explained
retrieved from https://www.qualtrics.com/experience-
management/brand/what-is-market-segmentation/
Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000).
Constrained k-means clustering (Technical Report MSR-TR-
2000-65). Microsoft Research, Redmond, WA.
K means clustering, AlindGupta retrieved from
https://www.geeksforgeeks.org/k-means-clustering-introduction/
Thank you
Any Questions
32. .MsftOfcThm_Accent1_Fill {
fill:#4472C4;
}
.MsftOfcThm_Accent1_Stroke {
stroke:#4472C4;
}
RcodeProject.R
##########################################
# section 3.3 Statistical Methods for Evaluation
##########################################
##########################################
# section 3.3.1 Hypothesis Testing
##########################################
# generate random observations from the two populations
x <- rnorm(10, mean=100, sd=5) # normal distribution centered
at 100
y <- rnorm(20, mean=105, sd=5) # normal distribution centered
at 105
# Student's t-test
t.test(x, y, var.equal=TRUE) # run the Student's t-test
# obtain t value for a two-sided test at a 0.05 significance level
qt(p=0.05/2, df=28, lower.tail= FALSE)
# Welch's t-test
t.test(x, y, var.equal=FALSE) # run the Welch's t-test
33. # Wilcoxon Rank-Sum Test
wilcox.test(x, y, conf.int = TRUE)
##########################################
# section 3.3.6 ANOVA
##########################################
offers <- sample(c("offer1", "offer2", "nopromo"), size=500,
replace=T)
# Simulated 500 observations of purchase sizes on the 3 offer
options
purchasesize <- ifelse(offers=="offer1", rnorm(500, mean=80,
sd=30),
ifelse(offers=="offer2", rnorm(500, mean=85,
sd=30),
rnorm(500, mean=40, sd=30)))
# create a data frame of offer option and purchase size
offertest <- data.frame(offer=as.factor(offers),
purchase_amt=purchasesize)
# display a summary of offertest where offer="offer1"
summary(offertest[offertest$offer=="offer1",])
# display a summary of offertest where offer="offer2"
summary(offertest[offertest$offer=="offer2",])
# display a summary of offertest where offer="nopromo"
summary(offertest[offertest$offer=="nopromo",])
# fit ANOVA test
model <- aov(purchase_amt ~ offers, data=offertest)
summary(model)