Datasets using R-StudioUsha Rani Singh.docx

Datasets using R-Studio
Usha Rani Singh
1
Datasets for cars
Dataset is a collection of related information which is useful to
analyze data and derive the outputs
The dataset contains information in various forms, and it isn't
straightforward for the analyzer to extract the data and present
it to the business
2
Preparing Dataset for cars

Preparing and analyzing the dataset is very important for any
threat information, which helps to provide accurate data
We have to consider the data which provide more value or
relevant for the problem
Categorize the data into regression, classification, clustering,
and ranking
It is difficult to establish data collection mechanism and data is
scattered into various forms and departments
We have to make consistency in the data
Data sample has been reduced, and at the same time it should
consist of the required information
Preparing Dataset for cars
We have to clean the data so that the processing time will be
faster and accurate
Complex datasets have to be decomposed into multiple parts
Data normalization has to be performed to improve the quality
of the data
R Studio for dataset

Pie Diagram for data set
ggplot_1 for dataset
ggplot_2 for data set
ggplot_3 for dataset

Microsoft
Excel Worksheet
Microsoft
Excel Worksheet
Week 10 – Analysing Data sets in RapidMiner
The data sets used for this weeks analysis relates to the CSRIC
best practices:
The CSRIC Best Practices Search Tool allows you to search
CSRIC's collection of Best Practices using a variety of criteria
including Network Type, Industry Role, Keywords, Priority
Levels, and BP Number. The Communications Security,
Reliability and Interoperability Council's (CSRIC) mission is to
provide recommendations to the FCC to ensure, among other
things, optimal security and reliability of communications
systems, including telecommunications, media, and public
safety. CSRIC’s members focus on a range of public safety and
homeland security-related communications matters, including:
(1) the reliability and security of communications systems and
infrastructure, particularly mobile systems; (2) 911, Enhanced
911 (E911), and Next Generation 911 (NG911); and (3)
emergency alerting.

The CSRIC's recommendations will address the prevention and
remediation of detrimental cyber events, the development of
best practices to improve overall communications reliability,
the availability and performance of communications services
and emergency alerting during natural disasters, terrorist
attacks, cyber security attacks or other events that result in
exceptional strain on the communications infrastructure, the
rapid restoration of communications services in the event of
widespread or major disruptions and the steps communications
providers can take to help secure end-users and servers.
I have used RapidMiner to analyze the data set :
The statistical view of various names, types and attributes
related to the data set.
Visualization of public safety vs prioritization

Overall prioritization pie chart
Bar graph comparing various network types and internet/data
usage
customer-segmentation-data set.zip
Mall_Customers.csv
CustomerID,Gender,Age,Annual Income (k$),Spending Score
(1-100)
1,Male,19,15,39
2,Male,21,15,81
3,Female,20,16,6
4,Female,23,16,77
5,Female,31,17,40
6,Female,22,17,76
7,Female,35,18,6

8,Female,23,18,94
9,Male,64,19,3
10,Female,30,19,72
11,Male,67,19,14
12,Female,35,19,99
13,Female,58,20,15
14,Female,24,20,77
15,Male,37,20,13
16,Male,22,20,79
17,Female,35,21,35
18,Male,20,21,66
19,Male,52,23,29
20,Female,35,23,98
21,Male,35,24,35
22,Male,25,24,73
23,Female,46,25,5
24,Male,31,25,73
25,Female,54,28,14

26,Male,29,28,82
27,Female,45,28,32
28,Male,35,28,61
29,Female,40,29,31
30,Female,23,29,87
31,Male,60,30,4
32,Female,21,30,73
33,Male,53,33,4
34,Male,18,33,92
35,Female,49,33,14
36,Female,21,33,81
37,Female,42,34,17
38,Female,30,34,73
39,Female,36,37,26
40,Female,20,37,75
41,Female,65,38,35
42,Male,24,38,92
43,Male,48,39,36

44,Female,31,39,61
45,Female,49,39,28
46,Female,24,39,65
47,Female,50,40,55
48,Female,27,40,47
49,Female,29,40,42
50,Female,31,40,42
51,Female,49,42,52
52,Male,33,42,60
53,Female,31,43,54
54,Male,59,43,60
55,Female,50,43,45
56,Male,47,43,41
57,Female,51,44,50
58,Male,69,44,46
59,Female,27,46,51
60,Male,53,46,46
61,Male,70,46,56

62,Male,19,46,55
63,Female,67,47,52
64,Female,54,47,59
65,Male,63,48,51
66,Male,18,48,59
67,Female,43,48,50
68,Female,68,48,48
69,Male,19,48,59
70,Female,32,48,47
71,Male,70,49,55
72,Female,47,49,42
73,Female,60,50,49
74,Female,60,50,56
75,Male,59,54,47
76,Male,26,54,54
77,Female,45,54,53
78,Male,40,54,48
79,Female,23,54,52

80,Female,49,54,42
81,Male,57,54,51
82,Male,38,54,55
83,Male,67,54,41
84,Female,46,54,44
85,Female,21,54,57
86,Male,48,54,46
87,Female,55,57,58
88,Female,22,57,55
89,Female,34,58,60
90,Female,50,58,46
91,Female,68,59,55
92,Male,18,59,41
93,Male,48,60,49
94,Female,40,60,40
95,Female,32,60,42
96,Male,24,60,52
97,Female,47,60,47

98,Female,27,60,50
99,Male,48,61,42
100,Male,20,61,49
101,Female,23,62,41
102,Female,49,62,48
103,Male,67,62,59
104,Male,26,62,55
105,Male,49,62,56
106,Female,21,62,42
107,Female,66,63,50
108,Male,54,63,46
109,Male,68,63,43
110,Male,66,63,48
111,Male,65,63,52
112,Female,19,63,54
113,Female,38,64,42
114,Male,19,64,46
115,Female,18,65,48

116,Female,19,65,50
117,Female,63,65,43
118,Female,49,65,59
119,Female,51,67,43
120,Female,50,67,57
121,Male,27,67,56
122,Female,38,67,40
123,Female,40,69,58
124,Male,39,69,91
125,Female,23,70,29
126,Female,31,70,77
127,Male,43,71,35
128,Male,40,71,95
129,Male,59,71,11
130,Male,38,71,75
131,Male,47,71,9
132,Male,39,71,75
133,Female,25,72,34

134,Female,31,72,71
135,Male,20,73,5
136,Female,29,73,88
137,Female,44,73,7
138,Male,32,73,73
139,Male,19,74,10
140,Female,35,74,72
141,Female,57,75,5
142,Male,32,75,93
143,Female,28,76,40
144,Female,32,76,87
145,Male,25,77,12
146,Male,28,77,97
147,Male,48,77,36
148,Female,32,77,74
149,Female,34,78,22
150,Male,34,78,90
151,Male,43,78,17

152,Male,39,78,88
153,Female,44,78,20
154,Female,38,78,76
155,Female,47,78,16
156,Female,27,78,89
157,Male,37,78,1
158,Female,30,78,78
159,Male,34,78,1
160,Female,30,78,73
161,Female,56,79,35
162,Female,29,79,83
163,Male,19,81,5
164,Female,31,81,93
165,Male,50,85,26
166,Female,36,85,75
167,Male,42,86,20
168,Female,33,86,95
169,Female,36,87,27

170,Male,32,87,63
171,Male,40,87,13
172,Male,28,87,75
173,Male,36,87,10
174,Male,36,87,92
175,Female,52,88,13
176,Female,30,88,86
177,Male,58,88,15
178,Male,27,88,69
179,Male,59,93,14
180,Male,35,93,90
181,Female,37,97,32
182,Female,32,97,86
183,Male,46,98,15
184,Female,29,98,88
185,Female,41,99,39
186,Male,30,99,97
187,Female,54,101,24

188,Male,28,101,68
189,Female,41,103,17
190,Female,36,103,85
191,Female,34,103,23
192,Female,32,103,69
193,Male,33,113,8
194,Female,38,113,91
195,Female,47,120,16
196,Female,35,120,79
197,Female,45,126,28
198,Male,32,126,74
199,Male,32,137,18
200,Male,30,137,83
Mall Customer Segmentation Data Analysis.pptx

Mall Customer Segment Data Analysis using RFM
Vivek Ijjagiri
Agenda
2
Introduction
Mall Customer Segmentation data
Mall Customer Segment analysis data using RFM
Problem Solving

Clustering
Conclusion
References
Introduction
When we want to increase the sales we need to do planning for
marketing spend, or while formulating a new promotion, as a
retail marketer we have to be more careful about how we
segment and target the customers. It would be a waste of time
and money if, for example, we launch an ad campaign that is
central to a lot of customers. Such untargeted marketing and
advertising is not likely to have a high conversion fee and may
additionally even hurt our company value.
Retailers now use sophisticated strategies to section their
customers and goal their marketing efforts to these segments.
RFM analysis is one such famous patron segmentation technique
that can assist shops to maximize the return on their advertising
investments.
Why RFM.?
Improving customer segmentation marketing and widely used
for surveys.
Superior and simplistic compared to other methods.(CHAID and

logistic regression)
Focuses on transaction information and delivering better
marketing to customers.
What is RFM?
R => Recency
F => Frequency
M=> Monetary
How are we using the RFM and target customers?
Simple we score the customers based on the RFM from high to
low.
Greater the score there’s likely more chance to buy a product or
take a new offer or promotion.
It’ll help us identify customers that are most likely to respond
to a new offer or promotion.
Identifying the most valuable RFM segments can capitalize on
chance relationships in the data used for this analysis.

Mall Customer Segment analysis data using RFM
7
Recency: Recency is most important predictor of customers who
did the purchases recently. Customers who have purchased
recently a product are more likely to purchase again from your
store/mall compared to those who did not purchase recently.
Frequency: The second most important factor is how frequently
these customers purchase from you. The higher the frequency,
the higher of chances of them purchasing the products again.
Monetary: The third factor is the amount of money these
customers have spent on purchases. Customers who have spent
higher are more likely to purchase based on their recent
purchase compared to those who have spent less.
How are we going to calculate RFM?
To implement the RFM analysis, we need to further process the
data set in by the following steps:
Find the most recent date for each ID and calculate the days to
the now or some other date, to get the Recency data
Calculate the quantity of translations of a customer, to get the
Frequency data
Sum the amount of money a customer spent and divide it by

Frequency, to get the amount per transaction on average, that is
the Monetary data.
8
Problem Solving
Make sure we have the following libraries to procced with the
data analysis, if the libraries not found in your R Studio install
those packages.
library(data.table)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(rmarkdown)
9
Load and examine data
> Mall_Customers<- fread('data.csv’)
> glimpse(Mall_Customers)

Ijjagiri, Vivek (IV) - This is like a transposed version of print:
columns run down the page, and data runs across. This makes it
possible to see every column in a data frame. It's a little like str
applied to a data frame but it tries to show you as much data as
possible. (And it always shows the underlying data, even when
applied to a remote data source.)
View Data
14
Data Cleanup
Or
WRangle
15
> Mall_Customers<- Mall_Customers%>%
mutate(Quantity = replace(Quantity, Quantity<=0, NA),
UnitPrice = replace(UnitPrice, UnitPrice<=0, NA))

> Mall_Customers<- Mall_Customers%>%
drop_na()
Recode Variables
> df_data <- df_data %>%
mutate(InvoiceNo=as.factor(InvoiceNo),
StockCode=as.factor(StockCode),
InvoiceDate=as.Date(InvoiceDate, '%m/%d/%Y
%H:%M'), CustomerID=as.factor(CustomerID),
Country=as.factor(Country))
> df_data <- df_data %>%
mutate(total_dolar = Quantity*UnitPrice)
> glimpse(df_data) | summary(df_data)
16
Calculate RFM
> df_RFM <- df_data %>%
group_by(CustomerID) %>%
summarise(recency=as.numeric(as.Date("2012-01-01")-
max(InvoiceDate)),
frequency=n_distinct(InvoiceNo), monitery=
sum(total_dolar)/n_distinct(InvoiceNo))

> summary(df_RFM)
17
Calculate RFM
> kable(head(df_RFM))
18
K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.
The objective of K-means is simple: group similar data points
together and discover underlying patterns.
To achieve this objective, K-means looks for a fixed number (k)
of clusters in a dataset.”
A cluster refers to a collection of data points aggregated
together because of certain similarities.
In other words, the K-means algorithm identifies k number of
centroids, and then allocates every data point to the nearest
cluster, while keeping the centroids as small as possible.

K Means Clustering Algorithm
1.Specify number of clusters K.
2.Initialize centroids by first shuffling the dataset and then
randomly selecting K data points for the centroids without
replacement.
3.Keep iterating until there is no change to the centroids. i.e
assignment of data points to clusters isn’t changing.
K Means clustering algorithm
Recency
Recency – How recently did the customer purchase?
> Customer_Purchase_Recency <- df_RFM$recency
> hist(Customer_Purchase_Recency, main = 'Recency')
20
Frequency
Frequency – How often do they purchase?
> Customer_Purchase_Frequency <- df_RFM$frequency
> hist(Customer_Purchase_Frequency, main = ‘Frequency')
21

Monetary
Monetary Value – How much do they spend?
> Customer_Purchase_Monitery <- df_RFM$monitery
> hist(Customer_Purchase_Monitery, main = ‘Monetary’,
breaks=50 )
22
Monetary Log
Because the data is skewed, we use log scale to normalize
> MoniteryLog <- log(df_RFM$monitery)
> hist(MoniteryLog, main ='MoniteryLog')
23
Ijjagiri, Vivek (IV) -
https://www.rdocumentation.org/packages/amap/versions/0.8-
17/topics/hcluster
Ijjagiri, Vivek (IV) - This function is a mix of function hclust
and function dist. hcluster(x, method = "euclidean",link =
"complete") = hclust(dist(x, method = "euclidean"),method =
"complete")) It use twice less memory, as it doesn't store
distance matrix.
For more details, see documentation of hclust and Dist.
Clustering
> DataFrame_Clustering <- df_RFM

> DataFrame_CustomerID <-
DataFrame_Clustering$CustomerID
> row.names(DataFrame_Clustering) <- DataFrame_CustomerID
> DataFrame_CustomerID <- NULL
> DataFrame_Clustering <- scale(DataFrame_Clustering)
> summary(DataFrame_Clustering )
24
Clustering
> d <- dist(DataFrame_Clustering)
> c <- hclust(d, method = 'ward.D2’)
> Plot(c)
25
Ijjagiri, Vivek (IV) - A dendrogram is a diagram that shows the
hierarchical relationship between objects. It is most commonly
created as an output from hierarchical clustering. The main use
of a dendrogram is to work out the best way to allocate objects
to clusters. The dendrogram below shows the hierarchical
clustering of six observations shown to on the scatterplot to the
left. (Dendrogram is often miswritten as dendogram.)
Plotting with less data

26
27
28
Conclusion
Customer segmentation process can be performed using various
clustering algorithms.
We focused on k-means clustering in R.
The algorithm is quite simple to implement. However,
representing data in the correct format and interpreting results
is the difficult part.

RFM Analysis can segment customers, design offers,
promotions specific to audience and produce products based on
customer profile and interests.
References
Shubhankar Rawat (May 2019), Mall Customers Segmentation
— Using Machine Learning retrieved from
https://towardsdatascience.com/mall-customers-segmentation-
using-machine-learning-274ddf5575d5
What is market segmentation, Different types explained
retrieved from https://www.qualtrics.com/experience-
management/brand/what-is-market-segmentation/
Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000).
Constrained k-means clustering (Technical Report MSR-TR-
2000-65). Microsoft Research, Redmond, WA.
K means clustering, AlindGupta retrieved from
https://www.geeksforgeeks.org/k-means-clustering-introduction/
Thank you
Any Questions

.MsftOfcThm_Accent1_Fill {
fill:#4472C4;
}
.MsftOfcThm_Accent1_Stroke {
stroke:#4472C4;
}
RcodeProject.R
##########################################
# section 3.3 Statistical Methods for Evaluation
##########################################
##########################################
# section 3.3.1 Hypothesis Testing
##########################################
# generate random observations from the two populations
x <- rnorm(10, mean=100, sd=5) # normal distribution centered
at 100
y <- rnorm(20, mean=105, sd=5) # normal distribution centered
at 105
# Student's t-test
t.test(x, y, var.equal=TRUE) # run the Student's t-test
# obtain t value for a two-sided test at a 0.05 significance level
qt(p=0.05/2, df=28, lower.tail= FALSE)
# Welch's t-test
t.test(x, y, var.equal=FALSE) # run the Welch's t-test

# Wilcoxon Rank-Sum Test
wilcox.test(x, y, conf.int = TRUE)
##########################################
# section 3.3.6 ANOVA
##########################################
offers <- sample(c("offer1", "offer2", "nopromo"), size=500,
replace=T)
# Simulated 500 observations of purchase sizes on the 3 offer
options
purchasesize <- ifelse(offers=="offer1", rnorm(500, mean=80,
sd=30),
ifelse(offers=="offer2", rnorm(500, mean=85,
sd=30),
rnorm(500, mean=40, sd=30)))
# create a data frame of offer option and purchase size
offertest <- data.frame(offer=as.factor(offers),
purchase_amt=purchasesize)
# display a summary of offertest where offer="offer1"
summary(offertest[offertest$offer=="offer1",])
# display a summary of offertest where offer="offer2"
summary(offertest[offertest$offer=="offer2",])
# display a summary of offertest where offer="nopromo"
summary(offertest[offertest$offer=="nopromo",])
# fit ANOVA test
model <- aov(purchase_amt ~ offers, data=offertest)
summary(model)

# Tukey's Honest Significant Difference (HSD) on all
# pair-wise tests for difference of means
TukeyHSD(model)
Lesson 2
1-1© 2015 Pearson Education, Inc. Publishing as Prentice Hall
Chapter 4
4-1
© 2015 Pearson Education, Inc. Publishing as Prentice Hall
“It is a set of beliefs that one party holds
about the other and how these beliefs are
formed from the interactions of […]
individuals as they engage in tasks
associated with an IT service” (Day 2007)
4-2
© 2015 Pearson Education, Inc. Publishing as Prentice Hall 4-3
It is a multifaceted interaction of people
and processes.

It is complex. Different expectations and
accountabilities may lead to lack of trust.
It tends to cluster into patterns (e.g., IT is
a necessary evil; IT is a support but not a
partner; business and IT are partners).
IT has to keep proving itself.
The business is often disengaged from IT
work.
Business expectations of IT change
continually.
Business assumptions of IT tend to cluster.
The relationship is affected by the
interaction of many people and
processes at multiple levels.
Clarity is often lacking around
expectations and accountabilities.
There are many “disconnects”
between the two groups.

Trust
Credibility
Competence
Value
Interpersonal Interaction
Expertise – the ability to support a technical
recommendation and have up-to-date knowledge.
Financial awareness – the ability to
identify the value of IT in terms of ROI
and total cost of ownership.
Execution – the ability to understand
the business, develop a vision and
operationalize strategies.
Find ways to develop business knowledge in
all IT staff.

Link IT’s success criteria to business metrics.
Make business value an explicit criteria in all
IT decisions.
Ensure effective execution in all IT activities.
Credibility is the belief that others can be
counted on to do what they say they will do.
It is built by:
Keeping agreements.
Acting with integrity, honesty and openness.
Being responsive (e.g., delivering on time
and under budget).
© 2015 Pearson Education, Inc. Publishing as Prentice Hall 4-
10
Communicate frequently and explicitly.
Pay attention to the “little things”.
Utilize external cues to credibility.
Assess all business touch points.

11
Professionalism - can be developed by five
sets of attitudes and behaviors:
on the job)
good organization.
job well)
12
Nontechnical communication
The ability to translate and interpret needs,
not only from business to technology and
vice versa, but also between business units.
13
Social sk ills
The ability to build mutual understanding, to

enable all parties to get comfortable with one
another and to uncover hidden assumptions.
14
Management of politics and conflict
The ability to understand the role of politics
and how they can affect the IT work (i.e.,
addressing conflict and use it to deliver
creative solutions).
15
Expect professionalism.
Promote a wide variety of social interactions
at all levels.
Develop “soft skills” in IT staff.
16
The most important way to build trust is through
an effective governance:
Integrating planning, defined accountabilities,

and clarity of roles and responsibilities are key
aspects of an effective governance.
An effective governance addresses the business’
expectations of its IT function.
17
Design governance for clarity and
transparency.
Mandate the relationship.
Design IT for business expectations.
18
Business-IT relationships are complex, with
interactions of many types, at many levels,
and between both individuals and across
functional and organizational entities.
Four majors components are needed to
build a strong business-IT relationship:
competence, credibility, interpersonal skills,
and trust.
Chapter 5

5-1
Communication is a key social element of
the organizational alignment between IT
and business.
One of the most important skills IT staff
needs to develop is how to communicate
effectively with businesses.
5-2
Good communication is essential for:
trust and partnerships between
the business and IT
perceptions of IT
of the business
5-3

Principle 1: The effectiveness of communication
is measured by its outcomes.
Principle 2: Communication is social behavior.
Principle 3: Shared knowledge improves
communication.
Principle 4: Mature organizations have better
communication.
5-4
Communication should be measure by its
outcomes rather than our intentions.
Communication can get distorted through
filters such as politics, culture, and
personal points of view.
Communication not only transmits ideas;
it also negotiates relationships.
How you say what you mean is just as

important as what you say.
IT staff and managers need to become
aware of the power of different linguistic
styles in communication situations.
The more IT staff
learns about the
business, the better
communication
becomes.
Shared knowledge is
the beginning of the
“virtuous circle”.
Shared Knowledge
Increased
Communication
Mutual Understanding
and “Common Sense”
Implementation
Success
THE VIRTUOUS
COMMUNICATION CYCLE

Strong organizational practices support and
reinforce good interpersonal communication.
Mature IT organizations embed appropriate
communication at the operational and
strategic level.
“You can’t be a partner unless
you’re a mature IT organization”
The changing nature of IT work:
IT work has become more complex over
time. Multiple cultures, different political
contexts, various times zones, and virtual
contacts make communication more
challenging.
10
Hiring practices:
IT skills are changing to become more
consultative and collaborative, rather
than focused exclusively on technology.
“IT organizations can no longer support smart,

super-talented but socially disruptive people”
11
IT and business organization
structures:
IT staff is expected to play a “knowledge
broker” role, not only between IT and
business but also between business units.
Thus, business silos can make this
communication challenging.
12
Nature and frequency of
communication:
Formal interactions improve communication,
but communication should not exclusively
occur in formal interactions (e.g., through IT
governance).
13
Attitude:

Many IT staff are motivated by the desire
to be right rather than the desire to
communicate effectively.
“We definitely need a ‘we’ attitude in IT,
rather than ‘us-them’ attitude”
14
Translation: A four-step process
Business
Impact of
Technology
Issues
Business
Technology
Issues
IT
Solution
s
Business

Datasets using R-StudioUsha Rani Singh.docx

Datasets using R-StudioUsha Rani Singh.docx

Recommended

Recommended

More Related Content

Similar to Datasets using R-StudioUsha Rani Singh.docx

Similar to Datasets using R-StudioUsha Rani Singh.docx (20)

More from edwardmarivel

More from edwardmarivel (20)

Recently uploaded

Recently uploaded (20)

Datasets using R-StudioUsha Rani Singh.docx