Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
(Praxis Business School)
Data Mining Assignment
A report on
Sales forecasting for Walmart
Submitted to
Prof. Suman K Maz...
2
Sales forecasting for Walmart
3
Table of Content
Sl
No Topic Page
1 Cover Page 1
2 Title Page 2
3 Executive Summary 3
4 Background 3
5 Business Problem ...
4
Executive Summary :
Walmart is the world'slargestcompanybyrevenue, according to the Fortune Global 500 list in
2014, as ...
5
train.csv
Thisis the historical trainingdata,whichcoversto2010-02-05 to 2012-11-01. Withinthisfile youwill
findthe follo...
6
Exploratory Analysis :
1.train.csv
1.1 Importing the raw dataset :
proc importout=walmart_traindatafile='/folders/myshor...
7
proc meansdata=walmart_train;
var Weekly_Sales;
run;
Analysis Variable : Weekly_Sales
N Mean Std Dev Minimum Maximum
421...
8
1.5 PlottingSales YearWise :
proc sql;
create table walmart_train_dataas
selectDate,sum(Weekly_Sales)asSales
fromwalmart...
9
Obs Date Sales
1 05/02/2010 49750740.5
0
2 12/02/2010 48336677.6
3
3 19/02/2010 48276993.7
8
4 26/02/2010 43968571.1
3
5...
10
2011 Sales Report :
data Sales_2011;
setwalmart_train_data(keep=SalesDate where=(Datebetween'07Jan2011'd and '30Dec2011...
11
Sales in tabular Form :
Obs Date Sales
1 07/01/20
11
42775787.7
7
2 14/01/20
11
40673678.0
4
3 21/01/20
11
40654648.0
3...
12
2012 Sales Report :
data Sales_2012;
setwalmart_train_data(keep=SalesDate where=(Datebetween'06Jan2012'd and '26Oct2012...
13
Sales intabular form - 2012
Obs Date Sales
1 06/01/20
12
44955421.9
5
2 13/01/20
12
42023078.4
8
3 20/01/20
12
42080996...
14
1.6.Outlier Treatment for train.csv :
The data being a time series record have some seasonalities .During the month of ...
15
16
As the spike in the sales would affect the entire model,the difference of excess sales has
been distributed across all ...
17
2.features.csv
2.1 Importing raw data set :
proc import out=walmart_features datafile='/folders/myshortcuts/myfolder/fe...
18
2.3 Checking the basic statistical measures of features.csv :
proc means data=walmart_features;
run;
2.4 OutlierTreatme...
19
data walmart_features_1(keep=StoreDate Weekly_Sales_nFuel_Price IsHoliday_YesMarkDown1_n
MarkDown1_n MarkDown2_n MarkDo...
20
Examining the final features dataset :
proc contentsdata=walmart_features_1;
run;
Alphabetic List of Variables and Attr...
21
Merging of trainand features for the final data set creation:
proc sql;
create table walmart_final_1 as
select
a.*,b.CP...
22
proc contents data=walmart_final_2;
run;
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa...
23
Printing the final dataset after merge :
proc print data=walmart_final_2(obs=10);
run;
O
b
s
St
or
e
D
e
pt
Date Weekly...
24
Model Building :
proc reg data=walmart_final_2;
model Weekly_Sales_new= Fuel_Price MarkDown3_n Temperature ;
run;
Upcoming SlideShare
Loading in …5
×

Walmart sales forecast

2,921 views

Published on

Walmart sales forecast

Published in: Data & Analytics
  • Login to see the comments

Walmart sales forecast

  1. 1. 1 (Praxis Business School) Data Mining Assignment A report on Sales forecasting for Walmart Submitted to Prof. Suman K Mazumdar In partial fulfillment of the requirements of the subject (iSAS) On (26th September, 2015) By Anurag Mukherjee
  2. 2. 2 Sales forecasting for Walmart
  3. 3. 3 Table of Content Sl No Topic Page 1 Cover Page 1 2 Title Page 2 3 Executive Summary 3 4 Background 3 5 Business Problem 3 6 Data Overview 4 7 Exploratory Analysis 5 8 Examining the final features dataset : 19 9 Merging of train and features for the final data set creation 20 10 Model Building 23
  4. 4. 4 Executive Summary : Walmart is the world'slargestcompanybyrevenue, according to the Fortune Global 500 list in 2014, as well as the biggestprivate employerin the world with 2.2 million employees. Walmart is a family-owned business, as the company is controlled by the Waltonfamily. Sam Walton's heirs own over 50 percent of Walmart through their holding company, Walton Enterprises, and through their individual holdings. It is also one of the world'smostvaluable companiesbymarketvalue,[10]and is also the largestgrocery retailer in the U.S. In 2009, it generated 51 percent of its US$258 billion (equivalent to $284 billion in 2015) sales in the U.S. from its grocery business. We are provided with datasets containing sales per store,per department on weekly basis.We are are about to forecast sales for Walmart to help the company in taking much better data driven decisions for inventory planning and channel optimization. Background: Wal-Mart Stores,Inc.isan Americanmultinational retailcorporation thatoperatesachain ofdiscountdepartmentstores andwarehousestores.Headquarteredin Bentonville, Arkansas,UnitedStates,the companywasfoundedby SamWaltonin1962 and incorporated on October31, 1969. It hasover11,000 storesin28 countries,underatotal of 65 banners.The companyoperatesunderthe Walmartname inthe UnitedStatesandCanada.It operatesasWalmart de Méxicoy CentroaméricainMexico,as Asdainthe UnitedKingdom, as SeiyuinJapan,andas Best Price inIndia.It has whollyownedoperationsinArgentina,Brazil,andCanada.Italsoownsand operatesthe Sam'sClubretail warehouses. Business Problem: Withhistorical salesdatafor45 Walmartstoreslocatedindifferentregions.Eachstore contains manydepartments,andthe aimisto projectthe salesfor eachdepartmentineachstore.To add to the challenge,selectedholidaymarkdowneventsare includedinthe dataset.These markdownsare knownto affectsales. Data Overview :
  5. 5. 5 train.csv Thisis the historical trainingdata,whichcoversto2010-02-05 to 2012-11-01. Withinthisfile youwill findthe followingfields:  Store - the store number  Dept- the departmentnumber  Date - the week  Weekly_Sales - salesforthe givendepartmentinthe givenstore  IsHoliday - whetherthe weekisaspecial holidayweek features.csv Thisfile containsadditional datarelatedtothe store,department,andregional activityforthe given dates.It containsthe followingfields:  Store - the store number  Date - the week  Temperature - average temperature inthe region  Fuel_Price - costof fuel inthe region  MarkDown1-5 - anonymizeddatarelatedtopromotionalmarkdownsthatWalmartisrunning. MarkDown data isonlyavailable afterNov2011, and isnot available forall storesall the time.Any missingvalue ismarkedwithanNA.  CPI - the consumerprice index  Unemployment- the unemploymentrate  IsHoliday - whetherthe weekisaspecial holidayweek
  6. 6. 6 Exploratory Analysis : 1.train.csv 1.1 Importing the raw dataset : proc importout=walmart_traindatafile='/folders/myshortcuts/myfolder/train_walmart.csv' dbms=csvreplace; getnames=yes; run; 1.2 Checkingthe contentsof train.csv : proc contents data=walmart_train; run; Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 3 Date Num 8 DDMMYY10. DDMMYY10. 2 Dept Num 8 BEST12. BEST32. 6 IsHoliday Char 5 $5. $5. 4 Month_Year Num 8 DATETIME. ANYDTDTM40. 1 Store Num 8 BEST12. BEST32. 5 Weekly_Sale s Num 8 BEST12. BEST32. 1.3 Checkingthe basic statistical measures
  7. 7. 7 proc meansdata=walmart_train; var Weekly_Sales; run; Analysis Variable : Weekly_Sales N Mean Std Dev Minimum Maximum 42157 0 15981.2 6 22711.1 8 -4988.94 693099.36 Negative SalesIndicate Returns. 1.4 Plot of Weekly_SalesVsDate :
  8. 8. 8 1.5 PlottingSales YearWise : proc sql; create table walmart_train_dataas selectDate,sum(Weekly_Sales)asSales fromwalmart_train groupby Date; run; 2010 Sales Report : data Sales_2010; setwalmart_train_data(keep=SalesDate where=(Datebetween'05Feb2010'd and '31Dec2010'd)); run; *plotting2010 SalesbyDate; ods graphics/ resetimagemap; proc sgplotdata=WORK.SALES_2010; vbarDate / response=Salesstat=Meanname='Bar'; yaxisgrid; run; ods graphics/ reset; proc printdata=Sales_2010; run;
  9. 9. 9 Obs Date Sales 1 05/02/2010 49750740.5 0 2 12/02/2010 48336677.6 3 3 19/02/2010 48276993.7 8 4 26/02/2010 43968571.1 3 5 05/03/2010 46871470.3 0 6 12/03/2010 45925396.5 1 7 19/03/2010 44988974.6 4 8 26/03/2010 44133961.0 5 (First 8 Sales figuresfor 2010 for convenience) 0 20000000 40000000 60000000 80000000Sales(Mean) 05/02/2010 12/02/2010 19/02/2010 26/02/2010 05/03/2010 12/03/2010 19/03/2010 26/03/2010 02/04/2010 09/04/2010 16/04/2010 23/04/2010 30/04/2010 07/05/2010 14/05/2010 21/05/2010 28/05/2010 04/06/2010 11/06/2010 18/06/2010 25/06/2010 02/07/2010 09/07/2010 16/07/2010 23/07/2010 30/07/2010 06/08/2010 13/08/2010 20/08/2010 27/08/2010 03/09/2010 10/09/2010 17/09/2010 24/09/2010 01/10/2010 08/10/2010 15/10/2010 22/10/2010 29/10/2010 05/11/2010 12/11/2010 19/11/2010 26/11/2010 03/12/2010 10/12/2010 17/12/2010 24/12/2010 31/12/2010 Date
  10. 10. 10 2011 Sales Report : data Sales_2011; setwalmart_train_data(keep=SalesDate where=(Datebetween'07Jan2011'd and '30Dec2011'd)); run; *plotting2011 SalesbyDate; ods graphics/ resetimagemap; proc sgplotdata=WORK.SALES_2011; vbarDate / response=Salesstat=Meanname='Bar'; yaxisgrid; run; ods graphics/ reset; proc printdata=Sales_2011; run; 0 20000000 40000000 60000000 80000000 Sales(Mean) 07/01/2011 14/01/2011 21/01/2011 28/01/2011 04/02/2011 11/02/2011 18/02/2011 25/02/2011 04/03/2011 11/03/2011 18/03/2011 25/03/2011 01/04/2011 08/04/2011 15/04/2011 22/04/2011 29/04/2011 06/05/2011 13/05/2011 20/05/2011 27/05/2011 03/06/2011 10/06/2011 17/06/2011 24/06/2011 01/07/2011 08/07/2011 15/07/2011 22/07/2011 29/07/2011 05/08/2011 12/08/2011 19/08/2011 26/08/2011 02/09/2011 09/09/2011 16/09/2011 23/09/2011 30/09/2011 07/10/2011 14/10/2011 21/10/2011 28/10/2011 04/11/2011 11/11/2011 18/11/2011 25/11/2011 02/12/2011 09/12/2011 16/12/2011 23/12/2011 30/12/2011 Date
  11. 11. 11 Sales in tabular Form : Obs Date Sales 1 07/01/20 11 42775787.7 7 2 14/01/20 11 40673678.0 4 3 21/01/20 11 40654648.0 3 4 28/01/20 11 39599852.9 9 5 04/02/20 11 46153111.1 2 6 11/02/20 11 47336192.7 9 7 18/02/20 11 48716164.1 2 (First8 Salesfiguresfor 2011 for convenience)
  12. 12. 12 2012 Sales Report : data Sales_2012; setwalmart_train_data(keep=SalesDate where=(Datebetween'06Jan2012'd and '26Oct2012'd)); run; *plotting2012 SalesbyDate; ods graphics/ resetimagemap; proc sgplotdata=WORK.SALES_2012; vbarDate / response=Salesstat=Meanname='Bar'; yaxisgrid; run; ods graphics/ reset; proc printdata=Sales_2012; run; ; 0 10000000 20000000 30000000 40000000 50000000 Sales(Mean) 06/01/2012 13/01/2012 20/01/2012 27/01/2012 03/02/2012 10/02/2012 17/02/2012 24/02/2012 02/03/2012 09/03/2012 16/03/2012 23/03/2012 30/03/2012 06/04/2012 13/04/2012 20/04/2012 27/04/2012 04/05/2012 11/05/2012 18/05/2012 25/05/2012 01/06/2012 08/06/2012 15/06/2012 22/06/2012 29/06/2012 06/07/2012 13/07/2012 20/07/2012 27/07/2012 03/08/2012 10/08/2012 17/08/2012 24/08/2012 31/08/2012 07/09/2012 14/09/2012 21/09/2012 28/09/2012 05/10/2012 12/10/2012 19/10/2012 26/10/2012 Date
  13. 13. 13 Sales intabular form - 2012 Obs Date Sales 1 06/01/20 12 44955421.9 5 2 13/01/20 12 42023078.4 8 3 20/01/20 12 42080996.5 6 4 27/01/20 12 39834974.6 7 5 03/02/20 12 46085608.0 9 6 10/02/20 12 50009407.9 2 7 17/02/20 12 50197056.9 6 8 24/02/20 12 45771506.5 7
  14. 14. 14 1.6.Outlier Treatment for train.csv : The data being a time series record have some seasonalities .During the month of December there’s a sales spike.This can be explained further by Markdowns. Markdown 1,2,4,5 doesnt seem to be that effective as compared to Markdown 3.
  15. 15. 15
  16. 16. 16 As the spike in the sales would affect the entire model,the difference of excess sales has been distributed across all the records. data wal; set walmart_train_data; where Sales > 50000000; sales_diff=Sales-46243899.58; run; proc sql; create table mapper as select sum(Sales_diff) from wal; run; *total excess sales from weeks having > 50000000 = 181638262.18; data walmart_final; set walmart_train; if Weekly_Sales > 50000000 then Weekly_Sales=46243899.58; Weekly_Sales_new=Weekly_Sales+(181638262.18/421570); run; proc univariate data=walmart_final; var Weekly_Sales; run;
  17. 17. 17 2.features.csv 2.1 Importing raw data set : proc import out=walmart_features datafile='/folders/myshortcuts/myfolder/features.csv' dbms=csv replace; getnames=yes; guessingrows=200; run; 2.2 Checking the contents of features.csv : Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 4 CPI Char 11 $11. $11. 2 Date Num 8 YYMMDD10. YYMMDD10. 6 Fuel_Price Num 8 BEST12. BEST32. 13 IsHoliday Char 5 $5. $5. 7 MarkDown1 Char 8 $8. $8. 8 MarkDown2 Char 8 $8. $8. 9 MarkDown3 Char 8 $8. $8. 10 MarkDown4 Char 8 $8. $8. 11 MarkDown5 Char 8 $8. $8. 1 Store Num 8 BEST12. BEST32. 5 Temperature Num 8 BEST12. BEST32. 12 Unemployme nt Char 5 $5. $5. 14 VAR14 Char 1 $1. $1. 3 Weekly_Sales Char 8 $8. $8.
  18. 18. 18 2.3 Checking the basic statistical measures of features.csv : proc means data=walmart_features; run; 2.4 OutlierTreatment : data walmart_f; setwalmart_features; formatDate DDMMYY10.; if MarkDown1="NA"or MarkDown1="#N/A" thenMarkDown1=0; if MarkDown2="NA"or MarkDown2="#N/A" thenMarkDown2=0; if MarkDown3="NA"or MarkDown3="#N/A" thenMarkDown3=0; if MarkDown4="NA"or MarkDown4="#N/A" thenMarkDown4=0; if MarkDown5="NA"or MarkDown5="#N/A" thenMarkDown5=0; if IsHoliday="TRUE"thenIsHoliday_Yes=1; else IsHoliday_Yes=0; if Weekly_Sales="#N/A"thenWeekly_Sales=0; run;
  19. 19. 19 data walmart_features_1(keep=StoreDate Weekly_Sales_nFuel_Price IsHoliday_YesMarkDown1_n MarkDown1_n MarkDown2_n MarkDown3_n MarkDown4_n MarkDown5_n Temperature UnemploymentCPI) ; setwalmart_f; MarkDown1_n=MarkDown1*1; MarkDown2_n=MarkDown2*1; MarkDown3_n=MarkDown3*1; MarkDown4_n=MarkDown4*1; MarkDown5_n=MarkDown5*1; Weekly_Sales_n=Weekly_Sales*1; run;
  20. 20. 20 Examining the final features dataset : proc contentsdata=walmart_features_1; run; Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 3 CPI Char 11 $11. $11. 2 Date Num 8 DDMMYY10. YYMMDD10. 5 Fuel_Price Num 8 BEST12. BEST32. 7 IsHoliday_Yes Num 8 8 MarkDown1_n Num 8 9 MarkDown2_n Num 8 10 MarkDown3_n Num 8 11 MarkDown4_n Num 8 12 MarkDown5_n Num 8 1 Store Num 8 BEST12. BEST32. 4 Temperature Num 8 BEST12. BEST32. 6 Unemployment Char 5 $5. $5. 13 Weekly_Sales_n Num 8
  21. 21. 21 Merging of trainand features for the final data set creation: proc sql; create table walmart_final_1 as select a.*,b.CPI,b.Temperature,b.Fuel_Price,b.MarkDown1_n,b.MarkDown2_n,b.MarkDown3_n,b. MarkDown4_n,b.MarkDown5_n,b.Unemployment,b.IsHoliday_Yes from walmart_final as a left join walmart_features_1 as b on a.Date=b.Date and a.Store=b.Store; run; data walmart_final_2 (drop=IsHoliday Month_Year Unemployment Weekly_Sales); set walmart_final_1; run;
  22. 22. 22 proc contents data=walmart_final_2; run; Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 5 CPI Char 11 $11. $11. 3 Date Num 8 DDMMYY10. DDMMYY10. 2 Dept Num 8 BEST12. BEST32. 7 Fuel_Price Num 8 BEST12. BEST32. 13 IsHoliday_Yes Num 8 8 MarkDown1_n Num 8 9 MarkDown2_n Num 8 10 MarkDown3_n Num 8 11 MarkDown4_n Num 8 12 MarkDown5_n Num 8 1 Store Num 8 BEST12. BEST32. 6 Temperature Num 8 BEST12. BEST32. 4 Weekly_Sales_new Num 8
  23. 23. 23 Printing the final dataset after merge : proc print data=walmart_final_2(obs=10); run; O b s St or e D e pt Date Weekly_ Sales_ne w CPI Temp eratur e Fuel _Pric e MarkD own1_ n MarkD own2_ n MarkD own3_ n MarkD own4_ n MarkD own5_ n IsHoli day_Y es 1 1 4 5 05/0 2/20 10 468.30 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 2 1 5 05/0 2/20 10 32660.24 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 3 1 9 05/0 2/20 10 17361.85 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 4 1 2 9 05/0 2/20 10 7455.81 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 5 1 9 2 05/0 2/20 10 140315.8 0 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 6 1 4 2 05/0 2/20 10 8797.57 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 7 1 8 0 05/0 2/20 10 16125.03 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 8 1 1 9 05/0 2/20 10 2377.91 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 9 1 3 2 05/0 2/20 10 12306.70 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0 1 0 1 4 0 05/0 2/20 10 67211.49 211.0 96358 2 42.31 2.57 2 0 0 0 0 0 0
  24. 24. 24 Model Building : proc reg data=walmart_final_2; model Weekly_Sales_new= Fuel_Price MarkDown3_n Temperature ; run;

×