2. CIS-5270 BUSINESS INTELLIGENCE
2
Table of Contents
S. No. Topic Page No.
1 Introduction and Goal 3
2 Data Set
1. Data Set URL
2. About the dataset
3. Dataset details
4. Column details
4
4
4
4-5
3 Data Cleaning
1. Renaming column
2. Removing unwanted column
3. Duplicating and splitting column
6-7
8-9
10-11
4 Analysis & Visualizations
1. Bar Chart
2. Histogram
3. Pie Chart
4. Tree Map
5. Correlation Matrix
6. Word Cloud
12-13
14-15
16-17
18-19
20-21
22-23
5 Statistical Summary & Functions
1. Statistical Summary
2. User Defined Functions
24-25
26-30
6 Code Summary 31-35
3. CIS-5270 BUSINESS INTELLIGENCE
3
INTRODUCTION AND GOAL
1. Introduction:
Superstores industry comprises of companies that operate by having large size spaces
which store and supply large amounts of goods. The superstore industry is comprised of
extensive stores that sell a typical product line of grocery items and merchandise
products, such as food, pharmaceuticals, apparel, games and toys, hobby items, furniture
and appliances. The analysis of such industry is of great importance as it gives insights
for the sales and profits of various products. Our analysis is based on a superstore dataset
for US country where the products are ordered between 2015 and 2018.
2. Goal: To find out various supermarket statistics such as –
Region that accounts for greater number of orders
Frequency distribution of quantity ordered
Percentage sales by category
Profitable category and sub-category
Category and sub-category that incurred losses
Product type that was ordered greater times
Yearly sales for various state.
With this analysis, the Superstore can identify various aspects of the shopping pattern and
take measures if required.
4. CIS-5270 BUSINESS INTELLIGENCE
4
DATA SET
1. Data Set URL:
https://data.world/stanke/sample-superstore-2018
2. About the dataset:
The dataset provides information about the sales and profit from a US supermarket from
the year 2015 to 2018.
3. Dataset details:
Size 2.4 MB
Number of columns 21
Number of rows 9994
Original file format XLS
4. Column details:
The dataset contains the following columns-
Column Name Column Detail
Row ID Unique row ID
Order ID Unique Order ID
Order Date Ordered Date of the Order
Ship Date Shipping Date of the Order
Ship Mode Shipping mode of the order
5. CIS-5270 BUSINESS INTELLIGENCE
5
Customer ID Unique ID of Customers
Customer Name Customer’s name
Segment Product Segment
Country US
City City of product ordered
State State of product ordered
Postal Code Postal code for the order
Region Region of product ordered
Product ID Unique Product id
Category Product category
Sub-Category Product sub-category
Product Name Name of the product
Sales Sales contribution of the order
Quantity Quantity ordered
Discount Discount provided on order
Profit Profit for the order
6. CIS-5270 BUSINESS INTELLIGENCE
6
DATA CLEANING
1. Renaming Column
Goal: The Colum name “CT” was not proper. The aim is to rename the column to “City”
Before
After
Code Used
8. CIS-5270 BUSINESS INTELLIGENCE
8
2. Removing unwanted Column
Goal: The Column named “Country” needs to be removed as it contains only one value
“United States”
Before
After
10. CIS-5270 BUSINESS INTELLIGENCE
10
3. Duplicating the column and Splitting it into 3 columns
Goal: To duplicate the column “Order.Date” to “order” and then split “order” into month,
day and year
Before
After
After duplicating After splitting order column
No column after Profit
11. CIS-5270 BUSINESS INTELLIGENCE
11
Code Used
superstore$order<-superstore$Order.Date
library(tidyr)
superstore<-separate(superstore,order,c("month","day","year"),sep="/")
Full Screenshot
12. CIS-5270 BUSINESS INTELLIGENCE
12
ANALYSIS & VISUALIZATIONS
1. What is the total number of orders by region?
Plot Type - Bar Chart
Function Used – barplot, table
Analysis
The above bar chart displays the total number of orders by region. It can be seen that the
Western region has the maximum order count (greater than 3000). The Western region is
followed by the Eastern region having an order count close to 3000. It is then followed by
the Central region with a count of around 2300. The least order has been placed by
Southern region (around 1500).
13. CIS-5270 BUSINESS INTELLIGENCE
13
Code Used
> countsR <- table(superstore$Region)
> barplot(countsR, main="Total Orders by Region",
+ xlab="Region", col="lightblue")
Full Screenshot
14. CIS-5270 BUSINESS INTELLIGENCE
14
2. What is the frequency distribution of quantity ordered?
Plot Type - Histogram
Function Used – hist
Analysis
The above histogram chart shows the frequency distribution of the quantity ordered. The
maximum ordered quantity is 1 which is greater than 3000. It is then followed by 2, the
frequency for which is close to 2500. Generally speaking, the frequency count is
decreasing as the quantity ordered is increasing. The quantity ordered 14 has the least
frequency.
15. CIS-5270 BUSINESS INTELLIGENCE
15
Code Used
> hist(superstore$Quantity, main="Frequency Distribution of Quantity
Ordered",
+
+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")
Full Screenshot
16. CIS-5270 BUSINESS INTELLIGENCE
16
3. What is the percentage sales by category?
Plot Type – Pie Chart
Function Used – pie, group_by, summarize, round, paste
Analysis
The above pie chart shows the percentage sales by category. There are three categories –
Technology, Furniture and Office Supplies. Product category “Technology” has
contributed maximum towards sales which is 36%. It is then followed “Furniture” which
is 32%. “Office Supplies” has contributed the least which is 31%.
18. CIS-5270 BUSINESS INTELLIGENCE
18
4. Which sub-category incurred losses? Which is the most profitable sub-category?
How are the overall sales for various category and sub-category?
Plot Type – Tree Map
Function Used – list, treemap
Analysis
The above is a Tree Map which provides information about the sales and profit of various
product category and sub-category. The cell size is decided by the sales. The color
gradient describes the profit. It can be concluded from the above map that the sub-
category “Phones” under “Technology” has the highest sale. The sub-category
“Furniture” incurred losses. Most profitable sub-category is “Copiers”.
19. CIS-5270 BUSINESS INTELLIGENCE
19
Code Used
> install.packages("treemap")
> library(treemap)
> treemap(data,index = c("Category","Sub.Category"),vSize ="Sales",vColor =
"Profit",type="value",palette="RdYlGn",range=c(-20000,60000),mapping=c(-
20000,10000,60000),title = "Sales Treemap For categories",fontsize.labels =
c(15,10),align.labels = list(c("centre","centre"),c("left","top")))
Full Screenshot
20. CIS-5270 BUSINESS INTELLIGENCE
20
5. What is the co-relationship between Sales, Quantity, Discount and Profit?
Plot Type – Correlation Matrix
Function Used – corrplot, cor
Analysis
This is a co-relation matrix chart which provide the co-relationship information about
various variables. The color gradient from Red to Blue describes the extent of co-
relationship among Sales, Quantity, Discount and Profit, red being the negative co-
relationship and blue being the positive co-relationship. It can be seen that “Sales” and
“Profit” are somewhat related. “Profit” and “Quantity” are also very weakly related.
“Profit” and “Discount” are negatively related.
21. CIS-5270 BUSINESS INTELLIGENCE
21
Code Used
> install.packages("corrplot")
> mydata <- superstore[, c(18,19,20,21)]
> View(mydata)
> library(corrplot)
> mydata.cor = cor(mydata)
> mydata.cor
> corrplot(mydata.cor)
Full Screenshot
22. CIS-5270 BUSINESS INTELLIGENCE
22
6. What are the product types that have been ordered maximum times?
Plot Type – Word Cloud
Function Used – wordcloud
Analysis
Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a
specific word appears in a source of textual data (such as a speech, blog post, or
database), the bigger and bolder it appears in the word cloud. In our case we want to
know what kind of products have been ordered frequently. Looking at the above word
cloud, it is clear product related to “Xerox” has been ordered the most. The product
related to binders, chairs and avery have also been ordered many times.
24. CIS-5270 BUSINESS INTELLIGENCE
24
STATISTICAL SUMMARY & FUNCTIONS
1. Statistical Summary
Question - Provide a statistical summary of the Sales.
Answer – Given below is the statistical summary of the Sales:
Statistics Value Meaning
Min.
(Minimum) 0.444 The lowest value of the sales present in the table
1st Qu.
(First
Quartile)
17.280
The first quartile (Q1) is defined as the middle number
between the smallest number and the median of the data
set. It splits off the lowest 25% of data from the highest
75%.
Median 54.490
It represents the middle number in a given sequence of
numbers when it’s ordered by rank.
Mean 229.858
It is the average of the Sales. It is the summation of all
Sales number divided by total number of Sales.
3rd Qu.
(Third
Quartile)
209.940
The third quartile (Q3) is defined as the middle number
between the median and the highest value of the data set.
It splits off the highest 25% of data from the lowest 75%.
Max.
(Maximum)
22638.480 The highest value of the sales present in the table.
25. CIS-5270 BUSINESS INTELLIGENCE
25
Code Usedfor Execution
> setwd("~/Desktop/BI")
> superstore<-read.csv("superstore.csv")
> View(superstore)
> summary(superstore$Sales)
Result
Full Screenshot
26. CIS-5270 BUSINESS INTELLIGENCE
26
2. User Defined Function
Question – What is the total sales for each year for a particular user provided state ?
Answer – As a solution to the above question, we created a user defined function, which
takes state name as input parameter and displays total sales by year for the provided state
by plotting a line graph.
The state name provided by the user is validated to check if the name is there in
superstore table or not. If not present, an error message is shown. If present, the line chart
is plotted to display the result.
Full Screenshot
29. CIS-5270 BUSINESS INTELLIGENCE
29
Function Code
# Function returns total sales by year for the entered state
statesales<-function(inputstate)
{
# importing libraries
library(tidyr)
library(dplyr)
library(ggplot2)
print(paste("The State provided by the user is: ", inputstate))
# retrieving distinct state name from the table
state_name<-distinct(superstore, State)
# checking if the state provided is correct or not
isvalid<- any(state_name == inputstate)
# if the state name provided is valid, a graph will be plotted
if (isvalid==TRUE)
{
selected<-select(superstore, State, Sales, year)
filtered<-filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)
# plotting line chart
ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")
+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}
else
{ print('Enter correct state name') }
}
30. CIS-5270 BUSINESS INTELLIGENCE
30
Execution Script
> setwd("~/Desktop/BI")
> source("sales.R")
> statesales("LA")
[1] "The State provided by the user is: LA"
[1] "Enter correct state name"
> statesales("California")
[1] "The State provided by the user is: California"
Group.1 x
1 15 91303.53
2 16 88443.84
3 17 131551.91
4 18 146388.34
31. CIS-5270 BUSINESS INTELLIGENCE
31
CODE SUMMARY
1. Data Cleaning Codes
a. Renaming Column
colnames(superstore)[colnames(superstore)=="CT"] <- "City"
b. Removing unwanted Column
superstore = subset(superstore, select = -c(Country) )
c. Duplicating the column and splitting into 3 columns
superstore$order<-superstore$Order.Date
library(tidyr)
superstore<-separate(superstore,order,c("month","day","year"),sep="/")
32. CIS-5270 BUSINESS INTELLIGENCE
32
2. Visualization Codes
a. Bar Chart
> countsR <- table(superstore$Region)
> barplot(countsR, main="Total Orders by Region",
+ xlab="Region", col="lightblue")
b. Histogram
> hist(superstore$Quantity, main="Frequency Distribution of Quantity
Ordered",
+
+ xlab="Quantity Ordered", ylab= "Frequency", col="lightpink")
c. Pie Chart
> install.packages("dplyr")
> library("dplyr")
> library(magrittr)
> gd <- superstore %>% group_by(Category) %>% summarize(Sales=sum(Sales))
> pct<-round(gd$Sales/sum(gd$Sales)*100)
> lbls<-paste(gd$Category,pct)
> lbls<-paste(lbls, "%", sep= " ")
> colors = c('lightskyblue','plum2','peachpuff')
> pie(gd$Sales, labels = lbls,main="Percentage Sales By Category",col=colors)
35. CIS-5270 BUSINESS INTELLIGENCE
35
4. User Defined Function Code
# Function returns total sales by year for the entered state
statesales<-function(inputstate)
{
# importing libraries
library(tidyr)
library(dplyr)
library(ggplot2)
print(paste("The State provided by the user is: ", inputstate))
# retrieving distinct state name from the table
state_name<-distinct(superstore, State)
# checking if the state provided is correct or not
isvalid<- any(state_name == inputstate)
# if the state name provided is valid, a graph will be plotted
if (isvalid==TRUE)
{
selected<-select(superstore, State, Sales, year)
filtered<-filter(selected,State==inputstate)
aggregated<-aggregate(filtered$Sales,by=list(filtered$year),sum)
print(aggregated)
# plotting line chart
ggplot(data=aggregated, aes(x=Group.1, y=x, group=1)) + geom_line(color="red")
+
geom_point(color="blue")+xlab("Year") + ylab("Total Sales") +
ggtitle("Total Sales by year")
}
else
{ print('Enter correct state name') }
}