2. Contents
• Business Objective – Problem Statement
• Solution Methodology
• Data Preparation and Consolidation
• Key Challenges
• Result
• Conclusion
3. The Google Play Store team is about to launch a new feature wherein,
certain apps that are promising, are boosted in visibility. The boost will
manifest in multiple ways including higher priority in recommendations
sections (“Similar apps”, “You might also like”, “New and updated
games”). These will also get a boost in search results visibility. This
feature will help bring more attention to newer apps that have the potential.
Make a model to predict the app rating, with other information about the app
provided.
Problem
Statements
Objectives
Business Objective – Problem Statement
4. Solution Methodology
Outlier Treatment with IQR
Data Cleaning
Data Visualization
Loss Calculation
Removing Skewness in the data
Model Building
Solution
Methodology
Technology
Enablers
Realized
Objectives
NumPy
Pandas
Plotly
Seaborn
Cufflinks
Sklearn
Prediction Model With minimum error
5. Data Preparation and Consolidation
• Data frame has 10841 rows and 13 distinct columns. The dependent variable is "Rating." We detect an
inappropriate rating in our data while visualizing the numbers of "Rating." Apps are generally rated between
0 and 5 stars.
• Removing unwanted variables with no proper correlation with the data Last updated, Current Ver,
Android Ver
• Transforming the data into a numerical form to view it. The conversion is completed by replacing all of the
strings and translating them into the numerical format in various methods.
• The process of cleaning the data has been done by imputing all the null values with median. As well as
dropping the rows which are greater than then interquartile range so that to avoid the outlier which has been
occurred previously
6. Data Preparation and Consolidation (Results)
After Removal of outliers
through Inter Quartile Range
8. Key Challenges
• Imputing null values in the dependent variable with median so as to reduce the bias in the final prediction
model (Imputing with mean increased deviation in data)
• Removing unwanted characters and converting to respective datatype
• Plotting different data using various plots like box, bar, Cumulative distribution plots for each and
collection of attributes to identify outliers and distribution of data in a given range.
• Outlier Treatment using Inter-Quartile-Range and reducing skewness using log transformation removing
non-linearity
• Availability of less data for training the model thereby identifying minimal patterns for prediction
• Comaparing the results of all the algorithms and to determine the best model which is quite ambigious on
the data with good accuracy and an increased error and vice versa
10. Conclusion
Project Plan and Support Required
Results
• OLS Regression gives better R-2 Score with 0.98 over
testing data but may not be used for prediction as
statistical techniques are ambiguous on new data with a
minimal error rate of 12%.
• Secondly, polynomial regression gives an R2- Score of
0.24 but RMSE is large sue to non-linearity of data.
• Bagging technique Decision Tree is unfit for prediction of
rating with less accuracy and more error.
• Further Hyperparameter tuning might improve SVR
accuracy as it R2-Score is 0.08 and Error Rate is 0.9
which require better computation power
• Overall OLS Regression and Support Vector regressor
proved to be better counter parts to predict the dependent
variable but training huge data is essential to learn
patterns as the error rate would decrease