Superstore Data Analysis using R

CIS-5270 BUSINESS INTELLIGENCE
1
Superstore Data Analysis
By:
Monika Mishra
Nanjesh Ramesh
CIS 5270: Business Intelligence
Submitted to: Professor Shilpa Balan

2
Table of Contents
S. No. Topic Page No.
1 Introduction and Goal 3
2 Data Set
1. Data Set URL
2. About the dataset
3. Dataset details
4. Column details
4
4
4
4-5
3 Data Cleaning
1. Renaming column
2. Removing unwanted column
3. Duplicating and splitting column
6-7
8-9
10-11
4 Analysis & Visualizations
1. Bar Chart
2. Histogram
3. Pie Chart
4. Tree Map
5. Correlation Matrix
6. Word Cloud
12-13
14-15
16-17
18-19
20-21
22-23
5 Statistical Summary & Functions
1. Statistical Summary
2. User Defined Functions
24-25
26-30
6 Code Summary 31-35

3
INTRODUCTION AND GOAL
1. Introduction:
Superstores industry comprises of companies that operate by having large size spaces
which store and supply large amounts of goods. The superstore industry is comprised of
extensive stores that sell a typical product line of grocery items and merchandise
products, such as food, pharmaceuticals, apparel, games and toys, hobby items, furniture
and appliances. The analysis of such industry is of great importance as it gives insights
for the sales and profits of various products. Our analysis is based on a superstore dataset
for US country where the products are ordered between 2015 and 2018.
2. Goal: To find out various supermarket statistics such as –
 Region that accounts for greater number of orders
 Frequency distribution of quantity ordered
 Percentage sales by category
 Profitable category and sub-category
 Category and sub-category that incurred losses
 Product type that was ordered greater times
 Yearly sales for various state.
With this analysis, the Superstore can identify various aspects of the shopping pattern and
take measures if required.

4
DATA SET
1. Data Set URL:
https://data.world/stanke/sample-superstore-2018
2. About the dataset:
The dataset provides information about the sales and profit from a US supermarket from
the year 2015 to 2018.
3. Dataset details:
Size 2.4 MB
Number of columns 21
Number of rows 9994
Original file format XLS
4. Column details:
The dataset contains the following columns-
Column Name Column Detail
Row ID Unique row ID
Order ID Unique Order ID
Order Date Ordered Date of the Order
Ship Date Shipping Date of the Order
Ship Mode Shipping mode of the order

5
Customer ID Unique ID of Customers
Customer Name Customer’s name
Segment Product Segment
Country US
City City of product ordered
State State of product ordered
Postal Code Postal code for the order
Region Region of product ordered
Product ID Unique Product id
Category Product category
Sub-Category Product sub-category
Product Name Name of the product
Sales Sales contribution of the order
Quantity Quantity ordered
Discount Discount provided on order
Profit Profit for the order

6
DATA CLEANING
1. Renaming Column
Goal: The Colum name “CT” was not proper. The aim is to rename the column to “City”
Before
After
Code Used

7
colnames(superstore)[colnames(superstore)=="CT"] <- "City"
Full Screenshot

8
2. Removing unwanted Column
Goal: The Column named “Country” needs to be removed as it contains only one value
“United States”
Before
After

9
Code Used
superstore = subset(superstore, select = -c(Country) )
Full Screenshot

10
3. Duplicating the column and Splitting it into 3 columns
Goal: To duplicate the column “Order.Date” to “order” and then split “order” into month,
day and year
Before
After
After duplicating After splitting order column
No column after Profit

11
Code Used
superstore$order<-superstore$Order.Date
library(tidyr)
superstore<-separate(superstore,order,c("month","day","year"),sep="/")
Full Screenshot

12
ANALYSIS & VISUALIZATIONS
1. What is the total number of orders by region?
Plot Type - Bar Chart
Function Used – barplot, table
Analysis
The above bar chart displays the total number of orders by region. It can be seen that the
Western region has the maximum order count (greater than 3000). The Western region is
followed by the Eastern region having an order count close to 3000. It is then followed by
the Central region with a count of around 2300. The least order has been placed by
Southern region (around 1500).

13
Code Used
> countsR <- table(superstore$Region)
> barplot(countsR, main="Total Orders by Region",
+ xlab="Region", col="lightblue")
Full Screenshot

14
2. What is the frequency distribution of quantity ordered?
Plot Type - Histogram
Function Used – hist
Analysis
The above histogram chart shows the frequency distribution of the quantity ordered. The
maximum ordered quantity is 1 which is greater than 3000. It is then followed by 2, the
frequency for which is close to 2500. Generally speaking, the frequency count is
decreasing as the quantity ordered is increasing. The quantity ordered 14 has the least
frequency.

15
Code Used
> hist(superstore$Quantity, main="Frequency Distribution of Quantity
Ordered",
+
+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")
Full Screenshot

16
3. What is the percentage sales by category?
Plot Type – Pie Chart
Function Used – pie, group_by, summarize, round, paste
Analysis
The above pie chart shows the percentage sales by category. There are three categories –
Technology, Furniture and Office Supplies. Product category “Technology” has
contributed maximum towards sales which is 36%. It is then followed “Furniture” which
is 32%. “Office Supplies” has contributed the least which is 31%.

17
Code Used
> install.packages("dplyr")
> library("dplyr")
> library(magrittr)
> gd <- superstore %>% group_by(Category) %>% summarize(Sales=sum(Sales))
> pct<-round(gd$Sales/sum(gd$Sales)*100)
> lbls<-paste(gd$Category,pct)
> lbls<-paste(lbls, "%", sep= " ")
> colors = c('lightskyblue','plum2','peachpuff')
> pie(gd$Sales, labels = lbls,main="Percentage Sales By Category",col=colors)
Full Screenshot

18
4. Which sub-category incurred losses? Which is the most profitable sub-category?
How are the overall sales for various category and sub-category?
Plot Type – Tree Map
Function Used – list, treemap
Analysis
The above is a Tree Map which provides information about the sales and profit of various
product category and sub-category. The cell size is decided by the sales. The color
gradient describes the profit. It can be concluded from the above map that the sub-
category “Phones” under “Technology” has the highest sale. The sub-category
“Furniture” incurred losses. Most profitable sub-category is “Copiers”.

19
Code Used
> install.packages("treemap")
> library(treemap)
> treemap(data,index = c("Category","Sub.Category"),vSize ="Sales",vColor =
"Profit",type="value",palette="RdYlGn",range=c(-20000,60000),mapping=c(-
20000,10000,60000),title = "Sales Treemap For categories",fontsize.labels =
c(15,10),align.labels = list(c("centre","centre"),c("left","top")))
Full Screenshot

20
5. What is the co-relationship between Sales, Quantity, Discount and Profit?
Plot Type – Correlation Matrix
Function Used – corrplot, cor
Analysis
This is a co-relation matrix chart which provide the co-relationship information about
various variables. The color gradient from Red to Blue describes the extent of co-
relationship among Sales, Quantity, Discount and Profit, red being the negative co-
relationship and blue being the positive co-relationship. It can be seen that “Sales” and
“Profit” are somewhat related. “Profit” and “Quantity” are also very weakly related.
“Profit” and “Discount” are negatively related.

21
Code Used
> install.packages("corrplot")
> mydata <- superstore[, c(18,19,20,21)]
> View(mydata)
> library(corrplot)
> mydata.cor = cor(mydata)
> mydata.cor
> corrplot(mydata.cor)
Full Screenshot

22
6. What are the product types that have been ordered maximum times?
Plot Type – Word Cloud
Function Used – wordcloud
Analysis
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a
specific word appears in a source of textual data (such as a speech, blog post, or
database), the bigger and bolder it appears in the word cloud. In our case we want to
know what kind of products have been ordered frequently. Looking at the above word
cloud, it is clear product related to “Xerox” has been ordered the most. The product
related to binders, chairs and avery have also been ordered many times.

23
Code Used
> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(RColorBrewer)
> library(wordcloud)
> wordcloud(words = superstore$Product.Name, min.freq = 1,
+ max.words=100, random.order=FALSE, rot.per=0.35,
+ colors=brewer.pal(8, "Dark2"))
Full Screenshot

24
STATISTICAL SUMMARY & FUNCTIONS
1. Statistical Summary
Question - Provide a statistical summary of the Sales.
Answer – Given below is the statistical summary of the Sales:
Statistics Value Meaning
Min.
(Minimum) 0.444 The lowest value of the sales present in the table
1st Qu.
(First
Quartile)
17.280
The first quartile (Q1) is defined as the middle number
between the smallest number and the median of the data
set. It splits off the lowest 25% of data from the highest
75%.
Median 54.490
It represents the middle number in a given sequence of
numbers when it’s ordered by rank.
Mean 229.858
It is the average of the Sales. It is the summation of all
Sales number divided by total number of Sales.
3rd Qu.
(Third
Quartile)
209.940
The third quartile (Q3) is defined as the middle number
between the median and the highest value of the data set.
It splits off the highest 25% of data from the lowest 75%.
Max.
(Maximum)
22638.480 The highest value of the sales present in the table.

25
Code Usedfor Execution
> setwd("~/Desktop/BI")
> superstore<-read.csv("superstore.csv")
> View(superstore)
> summary(superstore$Sales)
Result
Full Screenshot

26
2. User Defined Function
Question – What is the total sales for each year for a particular user provided state ?
Answer – As a solution to the above question, we created a user defined function, which
takes state name as input parameter and displays total sales by year for the provided state
by plotting a line graph.
The state name provided by the user is validated to check if the name is there in
superstore table or not. If not present, an error message is shown. If present, the line chart
is plotted to display the result.
Full Screenshot

27
Code Screenshot

28
Execution Screenshot
Line Chart Screenshot

29
Function Code
# Function returns total sales by year for the entered state
statesales<-function(inputstate)
{
# importing libraries
library(tidyr)
library(dplyr)
library(ggplot2)
print(paste("The State provided by the user is: ", inputstate))
# retrieving distinct state name from the table
state_name<-distinct(superstore, State)
# checking if the state provided is correct or not
isvalid<- any(state_name == inputstate)
# if the state name provided is valid, a graph will be plotted
if (isvalid==TRUE)
{
selected<-select(superstore, State, Sales, year)
filtered<-filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)
# plotting line chart
ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")
+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}
else
{ print('Enter correct state name') }
}

30
Execution Script
> source("sales.R")
> statesales("LA")
[1] "The State provided by the user is: LA"
[1] "Enter correct state name"
> statesales("California")
[1] "The State provided by the user is: California"
Group.1 x
1 15 91303.53
2 16 88443.84
3 17 131551.91
4 18 146388.34

31
CODE SUMMARY
1. Data Cleaning Codes
a. Renaming Column
colnames(superstore)[colnames(superstore)=="CT"] <- "City"
b. Removing unwanted Column
superstore = subset(superstore, select = -c(Country) )
c. Duplicating the column and splitting into 3 columns
superstore$order<-superstore$Order.Date
library(tidyr)
superstore<-separate(superstore,order,c("month","day","year"),sep="/")

32
2. Visualization Codes
a. Bar Chart
> countsR <- table(superstore$Region)
> barplot(countsR, main="Total Orders by Region",
+ xlab="Region", col="lightblue")
b. Histogram
> hist(superstore$Quantity, main="Frequency Distribution of Quantity
Ordered",
+
+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")
c. Pie Chart
> install.packages("dplyr")
> library("dplyr")
> library(magrittr)
> gd <- superstore %>% group_by(Category) %>% summarize(Sales=sum(Sales))
> pct<-round(gd$Sales/sum(gd$Sales)*100)
> lbls<-paste(gd$Category,pct)
> lbls<-paste(lbls, "%", sep= " ")
> colors = c('lightskyblue','plum2','peachpuff')
> pie(gd$Sales, labels = lbls,main="Percentage Sales By Category",col=colors)

33
d. Tree Map
> install.packages("treemap")
> library(treemap)
> treemap(data,index = c("Category","Sub.Category"),vSize ="Sales",vColor =
"Profit",type="value",palette="RdYlGn",range=c(-20000,60000),mapping=c(-
20000,10000,60000),title = "Sales Treemap For categories",fontsize.labels =
c(15,10),align.labels = list(c("centre","centre"),c("left","top")))
e. Correlation Matrix
> install.packages("corrplot")
> mydata <- superstore[, c(18,19,20,21)]
> View(mydata)
> library(corrplot)
> mydata.cor = cor(mydata)
> mydata.cor
> corrplot(mydata.cor)

34
f. Word Cloud
> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(RColorBrewer)
> library(wordcloud)
> wordcloud(words = superstore$Product.Name, min.freq = 1,
+ max.words=100, random.order=FALSE, rot.per=0.35,
+ colors=brewer.pal(8, "Dark2"))
3. Statistics Summary Code
> superstore<-read.csv("superstore.csv")
> View(superstore)
> summary(superstore$Sales)

35
4. User Defined Function Code
# Function returns total sales by year for the entered state
statesales<-function(inputstate)
{
# importing libraries
library(tidyr)
library(dplyr)
library(ggplot2)
print(paste("The State provided by the user is: ", inputstate))
# retrieving distinct state name from the table
state_name<-distinct(superstore, State)
# checking if the state provided is correct or not
isvalid<- any(state_name == inputstate)
# if the state name provided is valid, a graph will be plotted
if (isvalid==TRUE)
{
selected<-select(superstore, State, Sales, year)
filtered<-filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)
# plotting line chart
ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")
+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}
else
{ print('Enter correct state name') }
}

Superstore Data Analysis using R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Superstore Data Analysis using R

Similar to Superstore Data Analysis using R (20)

More from Monika Mishra

More from Monika Mishra (8)

Recently uploaded

Recently uploaded (20)

Superstore Data Analysis using R