Datascience - bigmart data analysis

Bigmart Sale Prediction
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019

Problem Statement
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
25-06-2019

Data Exploration

Commands
Output
Insights
• Item_Visibility contains 0.000 as values – meaningless
• Item_Identifier is a string with specific code
• Outlet_size contains NaN values
head

Commands
Output
Insights
» 12 features : Numeric – 5 , Categorical - 7
» Total no of entries: 8523
» Memory: ~ 800KB
» Outlet_size has null values(from previous slide data) even though all the fields has to be non-null
Info

Commands
Output
Insights » Data collection : 1985 to 2009
» Item_Visibility has a minimum value of 0.00
» Item_weight has count of less than 8523
Describe

Commands
Output
Insights » No of duplicates : 6964.
Possible reason: Same product can exist in multiple stores
Duplicates

Commands
Output
Insights » Item_Identifier has 1463 missing values.
» Outlet_size has 2410 missing values
Missing Values

Univariate Analysis

Commands
Output
Insights
» 16 different types.
» Possibility to reduce the item_Types to <16
Item_Type
25-06-2019

Commands
Output
Insights
» Regular is represented as multiple ways – Regular,
reg
» Low fat is represented as Low Fat, low fat & LF
» Replace 5 types with 2 – Regular & Low Fat
Item_Fat_Content

Commands
Output
Insights
» More no of Medium & Small size outlets
» Less no of High size outlets
Outlet_Size

Commands
Output
Insights
» Bigmart is present more in Tier 2& Tier 2 than
in Tier 1 cities
Outlet_Location_Type

Commands
Output
Insights
» SuperMarket Type1 is prominent.
Other 3 types are of same size
Outlet_Type

Commands
Output
Insights
» Item_visibility has lowest correlation with target
variable
» Item_MRP has strong positive correlation with
target variable.
Heatmap
Numerical variables

Individual feature vs Target

Commands
Output
Insights
» Item_Weight has low correlation with the target
Item_Outlet_Sales
Item_Weight vs Item_Outlet_Sales
25-06-2019

Commands
Output
Insights
» Items which are highly visible has less sales
(Possible reason: Daily groceries have higher
sales and they don’t need high visibility. Also
cosmetics with high rate might be kept in visible
position but usually its sales are less.)
» Many products are lying on x-axis stating that the
visibility is zero
» Distribution is skewed towards low visible items
Item_Visibility vs Item_Outlet_Sales

Commands
Output
Insights
» No visible relation between Year of establishment
and output sales.
» Only in 1998, the sales are less(Possible reason
could be less stores opened in that year – no data
provided on no of stores opened each year)
Outlet_Establishment_Year vs Item_Outlet_Sales

Commands
Output
Insights
» Low Fat product sales > Regular fat sales.
Item_Fat_Content vs Item_Outlet_Sales

Commands
Output
Insights
» Out of 10 stores,
2 – grocery store, 6 – Supermarket Type1,
1 – Supermarket Type 2, 1- Supermarket Type 3
Outlet_Type vs Outlet_Identifier

Commands
Output
Insights
» Medium SuperMarket Type3 has more sales than
others
Outlet_Type vs Item_Outlet_Sales

Commands
Output
Insights
» Groceries “OUT010” & “OUT019” have the lowest
sales results which is expected followed by the
“OUT018” – Based on previous 2 slides, this is
expected
Outlet_Identifier vs Item_Outlet_Sales

Commands
Output
Insights
» Medium store outlet are having more sales than
High and Low size outlets
Outlet_Size vs Item_Outlet_Sales

Commands
Output
Insights
» Sales of Tier2 > Sales of Tier 3 > Sales of Tier1
Outlet_Location_Type vs Item_Outlet_Sales

» Sales of Tier2 > Sales of Tier 3 > Sales of Tier1
» Item_Visibility does not have a high positive correlation.
» Item_Visibility has items with the value zero
» Item_Type does not influence the outlet_sales much.
» Item_Weight and Outlet_Size seem to present NaN values.
» Item_Fat_Content has vale “low fat” written in different manners.
» Outlet_Establishment_Year values vary from 1985 to 2009. Using this value directly does not make sense.
» Tier2 &Tier3 has better sales than Tier1 cities
» Too many data cleaning activities, better to combine the train and the test dataset.
Insights - Summary

Data Cleaning

Code
» Combine the train and test dataset
Reason : Since the data contains lot of missing values , null values and categorical values - reduce duplicate effort
Combine train & test dataset
Avoid re-work of cleaning the test dataset

Commands
Output
Insights » Missing values in Item_weight are replaced
Missing Values
Replace NaN in Item_Weight

Commands
Output
Insights » Replaced Missing values with mode – Size of outlets are few and makes sense
to replace the missing with most prominent outlets
Missing Values
Replace Outlet_Size

Commands
Output
Insights » Item_Visibility can’t be zero – replace with mean
Item_Visibility
Replace columns with zero values – zero makes no sense for this field

Feature Engineering

Commands
Output
Insights
» Item_Type has 16 categories which won’t be useful, transform them into 3
broad categories.
Item_Type
Transform 16 item types to 3

Commands
Output
Insights » Item_Type has non-consumable items which are categorized as fat contents,
these needs to be segregated as non-edible
Item_Type – Transform Non-consumables as Non-edible
Transform 2 item types to 3

Commands
Output
Insights
» Item_Type has 16 categories which won’t be useful, transform them into 3
broad categories.
Item_Fat_Content
Fix the Spelling mistakes – 5 categories to 2

Commands
Output
Insights
» Comparing the year of establishment of a store makes no sense. Transforming
into no of years of existence makes a good correlation to outlet_sales.
» Since the latest year of establishment is 2013, subtract all from 2013 to get no
of years of operation.
Outlet_Establishment_Years - Years of Operation of a store
Change the year to no of years in existence

Commands
Output
Insights
» scikit-learn library only accepts numerical variables so convert all categorical fields into
numericals.
» Having pure numericals will cause confusion as which is greater than other. So create dummies
to avoid confusion(Data transformed from pure numericals as in table 1 to dummies as in table 2)
Categorical Variables transformation
Transform categorical variables into numericals

Commands
Output
Insights
» Fields Item_Type & Outlet_Establishment_Year are dropped as they are of
object type. Also they are transformed into other variables types in previoud
slides.
Drop
Fields with object data type
» Fields Item_Type & Outlet_Establishment_Year are dropped
25-06-2019

Model Building

Code
» Segregate the data into train and test dataset for model prediction
Separate train & test dataset
Segregate the combined data into train & test
» Remove Item_Outlet_sales from test data and the source.
» Remove source from train data.
25-06-2019

Commands
Output
Insights » Accuracy of Linear regression model : 56.35
Linear Regression

Commands
Output
Insights » Accuracy of Decision Tree model : 61.45
Decision Tree

Commands
Output
Insights » Accuracy of RandomForest model : 60.81
RandomForest

Recommendations

Insights
» Removed outlet_type : Accuracy came down from 56.35 to 34.42
Outlet_sales is highly affected with type of outlet
» Removed item_mrp : Accuracy came down from 56.35 to 24.07
Outlet_sales is highly affected with mrp of the product
Further analysis & Recommendations
Key Factors
» Outlet_type and Item_MRP are the key factors affecting the outlet sales.
» Decision Tree model is the most accurate predicted model

Datascience - bigmart data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Recently uploaded

Recently uploaded (20)

Datascience - bigmart data analysis