1. Marketing Analysis for The Bee Corp
Qingyang(Kevin) Liu
Email:tug14939@temple.edu
June 22, 2017
1 Introduction of the dataset
The orginal file Quant Round.xlsx contains three sheets. However, xlsx format is proprietary format hence can
not be imported to R software without using other packages. I transfer Quant Round.xlsx into Quant Round.csv
file and only keep the first sheet since csv foramt has much better compatibility and the first sheet from Quant
Round.xlsx contains all information we need.
The import process using read.csv command for R software is shown below:
> df1 <- read.csv(file =
+ "/home/kevin/Desktop/The Bee Corp/Quant Round.csv",
+ header = T)
> dim(df1)
[1] 9994 22
The Quant Round.csv file has been imported into R as df1 data frame, which contains 9994 rows and 22 variables.
The summary information for important variables are shown below.
Row.ID: The primary key for this dataset. This variable is unique for each row.
Order.ID: The order identification. This variable doesn’t have to be unqiue. One order could contain multiple
rows (one order may contain different products.). There are 5009 distinct orders in df1.
Order.Date: The date when order was created or submitted. Order.Date was stored in numeric formation. I trans-
fer the numeric formation into yyyy-mm-dd formation, assuming the original date is "1900-01-01".
Ship.Date: The date when order was shipped. Also stored in numeric formation. I transfer the numeric formation
into yyyy-mm-dd formation, assuming the original date is "1900-01-01".
Ship.Mode: There are four different ship mode: Same Day, First Class, Standard Class and Second Class.
Customer.ID: Customer Identification. One customer has one unique ID.
Segment: There are three different segments, Customer, Corporate and Home Office, in this dataset.
(Corporate␣ has been corrected as Corporate)
Country: All orders have been shipped within United States.
City: There are 531 different cities in this dataset.
State: There are 48 contiguous U.S. states and the District of Columbia in this dataset.
(CAL␣ has been corrected as California. IND␣ has been corrected as Indiana)
1
2. Region: There are five regions, Central, East, North, South and West, in this dataset. There are few mistakes
in the original dataset. For example, there are 37 records in which Florida was categorized as North
region.
Product.ID: Production Identification. One product has one unique ID.
Category: All productions belong to three categories, funiture, office supplies and technology.
Sub.Category: The relationship between Sub.Category and Category are shown in Table 1.1.
One Sub.Category only belongs to one Category.
Table 1.1: Sub.Category (in column) and Category (in row)
Furniture Office Supplies Technology
Accessories 0 0 775
Appliances 0 466 0
Art 0 796 0
Binders 0 1523 0
Bookcases 228 0 0
Chairs 617 0 0
Copiers 0 0 68
Envelopes 0 254 0
Fasteners 0 217 0
Furnishings 957 0 0
Labels 0 364 0
Machines 0 0 115
Paper 0 1370 0
Phones 0 0 889
Storage 0 846 0
Supplies 0 190 0
Tables 319 0 0
SalesTotal: SalesTotal = Iterm.Price × Quantity, where Item.Price is the price after discount.
Profit: Positive number stands for profit. Negative number stands for deficit.
2
3. 2 Sales/Profit by Region
Figure 2.1: Maps of Sales and Profit in State Level
Total Sales in State Level
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
Profit in State Level
−20000
0
20000
40000
60000
80000
3
4. Table 2.1: Inconsistent definition of regions (part)
> head(table(df1$State,df1$Region),10)
Central East North South West
Alabama 0 0 1 60 0
Arizona 0 0 0 0 224
Arkansas 0 0 1 59 0
California 0 0 0 0 2001
Colorado 0 0 0 0 182
Connecticut 0 82 0 0 0
Delaware 0 96 0 0 0
District of Columbia 0 10 0 0 0
Florida 0 0 37 346 0
Georgia 0 0 6 178 0
Table 2.2: Top 4 States by Sales
> head(df2[order(-df2$sales.ratio),],4)
state sales profit sales.ratio
4 california 457687.6 76381.39 0.19923710
31 new york 310876.3 74038.55 0.13532829
42 texas 170188.0 -25729.36 0.07408497
46 washington 138641.3 33402.65 0.06035226
Table 2.3: Top 4 States by Profit
> head(df2[order(-df2$profit),],4)
state sales profit sales.ratio
4 california 457687.63 76381.39 0.19923710
31 new york 310876.27 74038.55 0.13532829
46 washington 138641.27 33402.65 0.06035226
21 michigan 76269.61 24463.19 0.03320111
Table 2.4: Discount in Texas
> table(df1[df1$State == "Texas","Discount"])
0.2 0.3 0.32 0.4 0.6 0.8
570 94 27 13 81 200
The rule of categorizing regions is doubtful and inconsistent in this dataset. According to Table 2.1, 37 records in
Florida have been defined as records in North region and 1 record in Alabama has been defined as a record in North
region. There are more than 100 records that have been defined in wrong regions. In real work, we need to discuss the
definition of each region with supervisor. For this analysis report, quantitative marketing analysis based on regions is
skipped.
4
5. According to Table 2.2 and Table 2.3, California is the largest market for the company and New York State is the
second largest market for the company either based on sales or by profit. The Sales/Profit performance in Texas
market is contradictory. By sales, Texas is the third largest market for the company. However, the company lost
$25, 729.36 in Texas market. By looking at Table 2.4, we find that the company has large discount policy in Texas
and that every product sold in Texas market has at least 20% discount. There are even 200 records of 80% discount.
Future more, by look at Table 2.5 and Table 2.6, we find out that sales in deficits markets have at least 20% discount.
Discount is a important reason for deficit in those market. We need to discuss the reason for applying large discount
strategy with business manager. It could be market penetration strategy or those products are too difficult to sell.
Table 2.5: Deficits Markets in States Level
> df3[df3$Profit < 0,]
State Profit SalesTotal Profit.Sales.Ratio
40 Oregon -1190.470 17431.15 -0.06829558
41 Florida -3399.302 89473.71 -0.03799219
42 Arizona -3427.925 35282.00 -0.09715789
43 Tennessee -5341.694 30661.87 -0.17421289
44 Colorado -6527.858 32108.12 -0.20330864
45 North Carolina -7490.912 55603.16 -0.13472097
46 Illinois -12607.887 80166.10 -0.15727205
47 Pennsylvania -15559.960 116511.91 -0.13354823
48 Ohio -16971.377 78258.14 -0.21686405
49 Texas -25729.356 170188.05 -0.15118192
Table 2.6: Discount in Deficits Markets
> tab1 <- table(df1[,c("State","Discount")])
> tab1[as.character(df3[df3$Profit < 0,"State"]),]
Discount
State 0 0.1 0.15 0.2 0.3 0.32 0.4 0.45 0.5 0.6 0.7 0.8
Oregon 0 0 0 100 0 0 0 0 5 0 19 0
Florida 0 0 0 299 0 0 0 11 6 0 67 0
Arizona 0 0 0 174 0 0 0 0 9 0 41 0
Tennessee 0 0 0 144 0 0 8 0 2 0 29 0
Colorado 0 0 0 138 0 0 0 0 4 0 40 0
North Carolina 0 0 0 201 0 0 8 0 4 0 36 0
Illinois 0 0 0 264 53 0 0 0 18 57 0 100
Pennsylvania 0 0 0 354 36 0 82 0 10 0 105 0
Ohio 0 0 0 290 23 0 67 0 8 0 81 0
Texas 0 0 0 570 94 27 13 0 0 81 0 200
Conclusion:
1. California and New York States are the first two most successful markets based on either profit or sales.
2. Companies are losing money in states like Texas, Ohio and many others due to large discount.
3. The region is ill-defined so no conclusion has been made based on it.
5
6. 3 Profit by Category/Subcategory/Specific Product
According to Table 3.1, we find that Technology and Office Supplies account for 50.79% and 42.77% of total
profit for the company. The products that belong to Furniture only contributes 6.44% of the total Profit for the
company.
Table 3.1: Profit by Category
> df4 <- ddply(df1[,c("Category","Profit")],.(Category),colwise(sum))
> df4 <- arrange(df4,-df4$Profit)
> df4$percent <- round(df4$Profit/sum(df4$Profit)*100,2)
> df4
Category Profit percent
1 Technology 145454.95 50.79
2 Office Supplies 122490.80 42.77
3 Furniture 18451.27 6.44
Figure 3.1: Profit by Category/Subcategory
Profit by Category/Subcategory
Profit/Deficit
Tables
Bookcases
Furnishings
Chairs
−20000 0 20000 40000 60000
Furniture
Supplies
Fasteners
Labels
Art
Envelopes
Appliances
Storage
Binders
Paper
Office Supplies
Machines
Accessories
Phones
Copiers
Technology
Profit
Deficit
6
7. Table 3.2: Most Profitable Products
Category Sub.Category Product.Name Total.Quantity Total_Profit Average.Term.Price Max.Item.Price Min.Item.Price
Technology Copiers
Canon image
CLASS 2200
Advanced Copier
20 25199.93 1259.996 3499.99 2099.994
Office Supplies Binders
Fellowes PB500
Electric Punch
Plastic Comb
Binding Machine
with Manual Bind
31 7753.039 250.098 1270.99 254.198
Technology Copiers
Hewlett Packard
LaserJet 3310 Copier
38 6983.884 183.7864 599.99 359.994
Technology Copiers
Canon PC1060
Personal Laser Copier
19 4570.935 240.5755 10559.99 559.992
Technology Machines
HP Designjet
T520 Inkjet
Large Format Printer
- 24" Color
12 4094.977 341.2481 1749.99 874.995
Technology Machines
Ativa V4110MDD
Micro-Cut Shredder
11 3772.946 342.9951 699.99 699.99
Looking at Figure 3.1, we find that all 4 products that belong to technology can make profit for the company. Copiers,
Phones and Accessories can make more than $40, 000 for the company! All products ,except Supplies, that belong
to Office Supplies can make profit for the companies. For the Furniture products, Chairs and Furnishings, can
make profit while Bookcases and Tables are responsible for deficit.
From Table 3.2, the most profitable product is Canon image CLASS2200 Advanced Copier, which is a copiers and a
sort of technology product. However, there is doubt about the Item.Price of Canon PC1060 Personal Laser Copier.
The Max.Item.Price for that product is $10, 559.99 while the Min.Item.Price is $559.992. The difference is too
large for a copier. I guess the difference was caused by Typo.I will discuss these large difference between maximum
item price and minimum item price in section 5.
Table 3.3: Details of Canon PC1060 Personal Laser Copier’s transaction
Product_Name Item_Price Quantity Discount
Canon PC1060 Personal Laser Copier 559.992 2 0.2
Canon PC1060 Personal Laser Copier 10559.992 5 0.2
Canon PC1060 Personal Laser Copier 559.992 5 0.2
Canon PC1060 Personal Laser Copier 699.99 7 0
Conclusion:
1. Products like copiers, phones, accessories in Technology category can make a lot of profit.
2. The performance of Furniture products are generally not good. Those products either make little profit and
loss much money for the company.
3. The most profitable product is Canon image CLASS2200 Advanced Copier.
4. Some Item.Price are doubtful, (in Table 3.3, same printer has been sold at $10, 559.992 and $559.99).
7
8. 4 Cluster Analysis (DEMO)
Cluster Analysis is a powerful tool for marketing analysis. The cluster analysis is very handy when there are many
continuous variables. Though we don’t have many continuous variables for this dataset, we can still use this methods
to have some interesting findings.
We create a new dataset after aggregating on State. The first 6 rows of the new dataset could be found in Table 4.1.
The cluster analysis is based on SalesTotal, Profit, Quantity and Avg.item.price.
Table 4.1: Dataset for clustering analysis
> df7 <- ddply(df1[,c("State","SalesTotal","Profit","Quantity")],
+ .(State),colwise(sum))
> df7$Avg.iterm.price <- df7$SalesTotal/df7$Quantity
> rownames(df7) <- as.character(df7$State)
> df7 <- df7[,2:5]
> head(df7)
SalesTotal Profit Quantity Avg.iterm.price
Alabama 19510.64 5786.825 256 76.21344
Arizona 35282.00 -3427.925 862 40.93040
Arkansas 11678.13 4008.687 240 48.65887
California 457687.63 76381.387 7667 59.69579
Colorado 32108.12 -6527.858 693 46.33206
Connecticut 13384.36 3511.492 281 47.63116
After standardizing each variable via scale function, we calculate the euclidean distance between each variable. Then
we choose "average" algorithm for clustering analysis.The initial result of clustering analysis could be found in Figure
4.1.
Figure 4.1: Initial Results - Cluster Analysis
California
NewYork
Wyoming
Texas
Washington
Vermont
Florida
Pennsylvania
Illinois
Ohio
Michigan
Virginia
Georgia
Indiana
RhodeIsland
Montana
Nevada
Maryland
Massachusetts
Missouri
Alabama
Oklahoma
Minnesota
Delaware
NewJersey
Kentucky
Wisconsin
NorthCarolina
Arizona
Colorado
Tennessee
WestVirginia
DistrictofColumbia
Idaho
Louisiana
Nebraska
NewHampshire
Mississippi
Arkansas
Connecticut
SouthCarolina
Utah
Oregon
Maine
Iowa
Kansas
NewMexico
NorthDakota
SouthDakota
0246
Average Linkage Clustering
hclust (*, "average")
d
Height
There are many criterion we can choose to determine the number of clusters. According to my experience, the
NbClust::NbClust function could be very helpful.
8
9. Figure 4.2: Determine the number of clusters
0 2 3 5 9 10
Number of Clusters Chosen by 26 Criteria
Number of Clusters
NumberofCriteria
02468
The NbClust::NbClust use 26 different criteria to determine the number of clusters. According to the result from
NbClust::NbClustin Figure 4.2, I decide to set the number of cluster equal 3.
Figure 4.3: Final Results - Cluster Analysis
California
NewYork
Wyoming
Texas
Washington
Vermont
Florida
Pennsylvania
Illinois
Ohio
Michigan
Virginia
Georgia
Indiana
RhodeIsland
Montana
Nevada
Maryland
Massachusetts
Missouri
Alabama
Oklahoma
Minnesota
Delaware
NewJersey
Kentucky
Wisconsin
NorthCarolina
Arizona
Colorado
Tennessee
WestVirginia
DistrictofColumbia
Idaho
Louisiana
Nebraska
NewHampshire
Mississippi
Arkansas
Connecticut
SouthCarolina
Utah
Oregon
Maine
Iowa
Kansas
NewMexico
NorthDakota
SouthDakota
0246
Average Linkage Clustering
3 Cluster Solution
hclust (*, "average")
d
Height
The final result could be found in Figure 4.3. New York and California are categorized as cluster 2. Wyoming is
categorized as cluster 3. The rest states are categorized as cluster 1.
Description of Clusters
> aggregate(df7, by = list(clusters), median)
Group.1 SalesTotal Profit Quantity Avg.iterm.price
1 1 20944.270 2116.598 268.5 57.87003
2 2 384281.951 75209.968 5945.5 66.64670
3 3 1603.136 100.196 4.0 400.78400
9
10. We can easily find that the average item price sold to Wyoming is as high as $400. This makes Wyoming a outlier
compared to other states. New York state and California are grouped together due to their outstanding performance in
profit. Other states are grouped together since the algorithm "thinks" the similarity between them is large. However,
I have to point out that this section is just a demo to illustrate my ability in data mining and machine learning. Much
more work still need to be done to draw serious conclusions.
5 Doubtful Item.Price
Table 5.1: Doubtful Item.Price
Category Sub_Category Product_Name Total_Profit Max_Item_Price Min_Item_Price Range
Furniture Furnishings
Deflect-o
DuraMat Antistatic
Studded Beveled Mat
for Medium Pile Carpeting
244.3888 10105.34 42.136 10063.2
Technology Accessories
Logitech P710e
Mobile Speakerphone
1645.361 10257.49 205.992 10051.5
Furniture Chairs
DMI Arturo Collection
Mission-style Design
Wood Chair
486.1556 10105.69 105.686 10000
Technology Copiers
Canon PC1060
Personal Laser Copier
4570.935 10559.99 559.992 10000
Technology Phones BlackBerry Q10 548.0565 10100.79 100.792 10000
Technology Phones
RCA ViSYS 25825
Wireless digital phone
90.993 10103.99 103.992 10000
Office Supplies Binders
Ibico EPK-21
Electric Binding System
3345.282 1889.99 377.998 1511.992
Technology Machines
Cubify CubeX 3D
Printer Double Head Print
-8879.97 2399.992 899.997 1499.995
Technology Copiers
Canon imageCLASS 2200
Advanced Copier
25199.93 3499.99 2099.994 1399.996
Office Supplies Binders
GBC DocuBind P400
Electric Binding System
-1878.17 1360.99 272.198 1088.792
Technology Machines
Lexmark MX611dhe
Monochrome Laser Printer
-4589.97 1529.991 509.997 1019.994
Office Supplies Binders
Fellowes PB500 Electric
Punch Plastic Comb Binding
Machine with Manual Bind
7753.039 1270.99 254.198 1016.792
Office Supplies Binders
Fellowes PB200 Plastic Comb
Binding Machine
693.5592 1050.997 50.997 1000
Office Supplies Envelopes
Tyvek Top-Opening
Peel & Seel Envelopes,
Plain White
225.0504 1021.744 21.744 1000
As I mentioned at the end of Section 3, the difference between maximum item price and minimum item price are too
large for some products. In Table 5.1, I will all products that have doubtful Item.Price. The Range variable equals
the difference between Max_Item_Price and Min_Item_Price. It is implausible that Blackberry Q10 could be sold
at $10, 100.79 meanwhile be sold at $100.79.
10