Project’s Primary Goals
1. To analyse past sales data to generate insights to understand what features of mobile phone that drive the sales.
2. To use these insights to efficiently plan the inventory in the next 6 months.
Data Description
3. Dataset consists of sales and product-related features.
4. Dataset contains descriptions of the top 5 most popular mobile brands.
5. Dataset consists of 418 row-instances and 16 column-features.
Strategies Deployed for Modelling
6. Check for, and treat with suitable methods, missing values in dataset.
7. Observe for, and take suitable steps to treat, outliers.
8. Check for multicollinearity amongst variables and use suitable steps to treat highly correlated variables.
9. Build a Linear Regression Model to predict the sales of mobile phones.
10. Report on the the metrics of the models.
11. Identify the significant variables, and rebuild and report on the model using only these variables only.
12. Based on the final model outcomes, determine the features driving mobile phone sales.
13. List down the recommendations to help in the inventory planning for the next 6 months.
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6 Months
1. Using Insight-informed
Data to Plan Inventory
in Next 6 Months
An Application of
Linear Regression Modelling
Author: Anthony Mok
Date: 16 Nov 2023
Email: xxiaohao@yahoo.com
2. Agenda
1. Modern-day
Data Analytics
2. Linear
Regression
Analysis
4. Project’s
Primary Goals
5. Description of
Dataset
7. Findings,
Conclusions &
Recommendations
An Application of Linear Regression Modelling
3. Code-less,
Less-code
Analytical Tools
6. Modelling
Strategies
3. Modern-day
Data Analytics
Both traditional and modern-day data analytics deal
with extracting insights from information, but they
differ significantly in their methods and capabilities
Here are the significant differences:
An Application of Linear Regression Modelling 3
Features Traditional Data Analytics Modern-Day Data Analytics
Data type Mostly structured Diverse (structured, semi-structured,
unstructured)
Technology On-premises Cloud-based
Processing Batch-oriented Real-time or near-real-time
Analysis Descriptive, diagnostic Predictive, prescriptive
Accessibility Limited to data analysts Aims for data democratisation
4. Linear
Regression Analysis
Linear regression is a statistical method that
models the relationship between a dependent
variable and one or more independent variables
using a straight line
It is used to understand trends, make predictions,
and test hypotheses
This analysis is suitable when the data exhibits a
linear relationship where assumptions like
normality and constant variance are held
An Application of Linear Regression Modelling
5. Code-less &
Less-code Data Analytics
Code-less Applications
Offer drag-and-drop
interfaces, pre-built
connectors, and
automated workflows,
making data analysis
accessible to everyone,
even without technical
expertise
Less-code Applications
Requiring some coding
knowledge, less-code
platforms provide pre-
written code snippets,
wizards, and visual tools
to streamline complex
tasks
5
An Application of Linear Regression Modelling
6. KNIME -
Less-code Data Analytics
• Knime is a less-code data analytics platform
• Build visual workflows with pre-built nodes for data
preparation, analysis, and visualisation
• No coding required, but Python integration empowers
customisation
• Unique Selling Points: Open-source, free, and powerful.
Handles diverse data, builds predictive models, and deploys
insights
• This project is carried out using KNIME
6
An Application of Linear Regression Modelling
7. Project’s Primary Goals
To analyse past sales data to
generate insights to understand
what features of mobile phone
that drive the sales
To use these insights to
efficiently plan the inventory in
the next 6 months
7
An Application of Linear Regression Modelling
8. Dataset Data Description
• Dataset consists of sales and product-related features
• Dataset contains descriptions of the top 5 most popular
mobile brands
• Dataset consists of 418 rows and 16 columns
Data Dictionary
• A sample data dictionary* is given below:
8
* More details are found in the project report, which are
not released at the request of the Social Enterprise
An Application of Linear Regression Modelling
9. Strategies
for Modelling
• Check for, and treat with suitable methods, missing values
in dataset
• Observe for, and take suitable steps to treat, outliers
• Check for multicollinearity amongst variables and use
suitable steps to treat highly correlated variables
• Build a Linear Regression Model to predict the sales of
mobile phones
• Report on the the metrics of the models
• Identify the significant variables, and rebuild and report on
the model using only these variables only
• Based on the final model outcomes, determine the features
driving mobile phone sales
• List down the recommendations to help in the inventory
planning for the next 6 months
9
An Application of Linear Regression Modelling
10. Check for Missing Values in Dataset By:
10
An Application of Linear Regression Modelling
• KNIME Workflow was created
• ‘CSV Read’ and ‘Data Explorer’ nodes were dragged
and dropped onto the KNIME Platform to ingest and
explore the variables and data in the dataset
• Using the ‘Interactive Viewer’ in the ‘Data Explorer’
node, 16 numeric variables were discovered
• The properties of the variables were expanded to
explore their missing values, and only missing values
in Rows 7, 18 & 397 in the ‘display size’ variable
were found
11. Treat Missing Values in Dataset
11
An Application of Linear Regression Modelling
• Since ‘display size’ is a categorical variable, the
mode of this variable was used to replace its missing
values
• To do this, the data in the display size column was
converted from numbers to strings, by using the
‘Number-To-String’ node
• The ‘Missing Value’ node was used to replace the
three missing values with their ‘Most Frequent
Value’, which is its mode, of 6.5
• Finally, the ‘String-To-Number’ node was deployed to
return this column of data to its original data format
for modeling purposes
12. Observe and Treat Outliers
12
An Application of Linear Regression Modelling
The Histogram for ‘ratings’
were constructed to study their distribution:
The distribution of ‘ratings’ is left skewed. The median for
‘ratings’, the middlemost value when the smallest to
largest rating were ordered, is 4.3, while the mean, the
average of all ratings, is 4.339. There is a difference of
0.039 between the mean and median, which places them
tightly together. When the middle value resembles the
average, the dataset for ‘ratings’ is symmetrically
distributed. About 50% of the ‘ratings’ were in the
Interquartile Range, which is between 4.3 to 4.4, while
about 25% of the ‘ratings’ are higher than Quartile 3,
between 4.4 and 4.5. About 25% of the ‘ratings’ are
lower than Quartile 1, between 4.2 to 4.3
Observe and Treat Outliers
13. Observe and Treat Outliers
13
An Application of Linear Regression Modelling
The Box Plot for ‘ratings’
were constructed to study their outliers:
Observe and Treat Outliers
Through the Box Plot, a total of 6 outliers were found. One
value (in Row 49 of the dataset) is above the upper whisker
boundary and five values (in Rows 158, 259, 286, 320
and 408 of the dataset) are below the lower whisker
boundary of the Box Plot. Relating these six outliers to real
life circumstances, the decision is not to treat them since it
is realistic to observe ratings of 4.6 (for ‘ratings’ in Row
49) and 3.0 (for ‘ratings’ in Row 320) in a 5-point scale
customer rating form. So, these rows are kept to enhance
analysis
14. 14
An Application of Linear Regression Modelling
Check for
Multicollinearity Amongst Variables
The ‘Linear Correlation’ node was engaged to observe the correlation coefficients
between all the numerical variables. After sorting the ‘Correlation Value’, in
descending order, in the ‘View’ function in the ‘Linear Correlation’ node, the
correlation value between the variables ‘num_of_ratings’ and ‘sales’ is 0.9418, which
is 94.18%. This suggests that these two variables are highly correlated.
Multicollinearity of variables reduces the precision of the estimated coefficients since
they shift wildly with slight changes in other independent variables. Under such
situation, the p-values are unable to identify independent variables that are
statistically significant. To strengthen the statistical power in the regression model,
the multicollinearity of these variables needs to be removed . Typically, variables
which correlation values are >0.70 are deemed highly correlated and need to be
treated
15. 15
An Application of Linear Regression Modelling
Treating for
Multicollinearity Amongst Variables
• Observe the correlation values and identify the highly correlated quantitative
(numerical) variables, that is, correlation value is >0.7
• Shift this variable to the ‘Exclude’ box of the ‘Configure’ function of the ‘Linear
Correlation’ node Using the remaining variables, re-execute the ‘Linear
Correlation’ node
• Observe the correlation values of the remaining variables after re-executing the
node
• Identify the next highly correlated variables
• Repeat this process until all the variables have correlation value of <0.7
• This process was not repeated as there were no other highly correlated
quantitative (numerical) variables found after treating the multicollinearity of
‘num_of_ratings’ and ‘sales’
The following steps were taken to achieve this outcome:
16. 16
An Application of Linear Regression Modelling
Build the Linear Regression Model By:
1. ‘Partitioning’ node was configured
to split the dataset in training and
testing sets by the ratio of 7:3
3. Two sets of ‘Regression Predictor’ and ‘Numeric
Scorer’ were created; one to ingest the training dataset
and the other to churn the data from the testing dataset
2. ‘Linear Regression Learner’ was
created with these configurations
with ‘sales’ as ‘Target’
17. 17
An Application of Linear Regression Modelling
Evaluate the Linear Regression Model
After feeding the training and testing dataset, from the
‘partitioning’ node, into the learner and predictors, their
numeric scorers produced the following metrics:
Training Dataset Numeric Scorer Testing Dataset Numeric Scorer
The model has performed well on both the training and testing datasets. The R-squared is around 0.882 on the training dataset
and 0.928 on the testing dataset. They have high R-squared values; the higher these values are, the better the model fits the data
and the predictions approximate the real data points. It is a clear indication that a good model has been created that is able to
explain the variance in the sales of mobile phones of up to 88%. Mean Absolute Error indicates that my model is able to predict
sales of mobile phones within the mean error of 9.4 units of SGD on testing dataset
18. 18
An Application of Linear Regression Modelling
Identify Significant Variables
The p-value measures the significance of observational data. There
are 11 variables which p-values are more than 0.05, starting with
‘battery_capacity’ at 0.799. Typically, p-value that is less than or
equals to 0.05 is statistically significant, which helps to determine if
the observed relationship that arises is not a result of chance
19. 19
An Application of Linear Regression Modelling
Rebuild Model Significant Variables Only By:
• Shifting the variable with the highest p-value, that is >0.05, to the
‘Exclude’ box of the ‘Configure’ function of the ‘Linear Regression
Learner’
• Using the remaining variable, re-execute the node
• Observing the changes in the p-values through the ‘Coefficients
and Statistics’ function of the node
• Identifying the next variable with the highest p-value
• Continuing to iterate the process until all p-values of remaining
variables are ≤ 0.05
These are the six variables with p-value ≤ 0.05 that are
retained to rebuilt the model since they are statistically
significant:
20. 20
An Application of Linear Regression Modelling
Evaluate the Rebuilt Linear Regression Model
After the model has been rebuilt, the scorers for the training and
testing dataset show the following information:
Training Dataset Numeric Scorer Testing Dataset Numeric Scorer
This model continues to perform well on both the training and testing datasets. The R-squared is
around 0.875 on the training dataset and 0.924 on the testing dataset. These are 0.007 and 0.004
lower than the original model. Nevertheless, they have high R-squared values, and higher these
values are, the better the model fits the data and the predictions approximate the real data points. It
is a clear indication that I am able to create a good rebuilt model that is able to explain the variance
in the sales of mobile phones of up to 88%. Mean Absolute Error indicates that my model is able to
predict sales of mobile phones within the mean error of 9 units of rupees on testing dataset
21. Findings &
Conclusions*
21
Key Features Driving Mobile Phone Sales
• It seems that ‘discount_percent’ is the only comparatively
higher coefficient with a positive impact on 'sales’. An
increase in one unit of ‘discount_percent’ will increase
‘sales’ by 0.46 unit of SGD
• Similarly, ‘display_size’ has the most negative impact on
'sales’. An increase in one unit of the ‘display_size’
variable would decrease the ‘sales’ by around 1 unit of
SGD
• In ranking order, ‘num_of_ratings’, ‘model’, ‘processor’
and ‘num_rear_camera’ have similar negative effects on
‘sales’. A unit increase of these would reduce ‘sales’ by
0.38 unit of SGD
* More details are found in the project report, which are
not released at the request of the Social Enterprise
22. Recommendations*
22
2. Stock smaller display sizes
of mobile phones with lesser
rear cameras
3. Narrow range of models to
stock
4. Keep phones which
processors are encoded at a
lower value
Recommendations On Inventory Planning
1. Look at including
higher discounts
* More details are found in the project report, which are
not released at the request of the Social Enterprise