1. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Car Price prediction final pdf
Computer science (AJ Institute of Engineering and Technology)
Studocu is not sponsored or endorsed by any college or university
2. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
A Dinesh
A Rahul
(19BD1A05C1)
(19BD1A05C5)
E Sri Kumar (19BD1A05CJ)
G Pranav (19BD1A05CK)
Under the guidance of
Ms. NASREEN SULTANA
Assistant Professor
Department of CSE
A Mini Project Report on
CAR PRICE PREDICTION USING LINEAR REGRESSION
Submitted to
Jawaharlal Nehru Technological University, Hyderabad
in partial fulfillment of requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
By
Department of Computer Science and Engineering
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
Approved by AICTE, Affiliated to JNTUH
3-5-1206, Narayanaguda, Hyderabad – 500029
2022-2023
3. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
(Accredited by NBA & NAAC, Approved By A.I.C.T.E., Reg by Govt of Telangana
State & Affiliated to JNTU, Hyderabad)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the project entitled CAR PRICE PREDICTION USING LINEAR
REGRESSION being submitted by
A. Dinesh
A. Rahul
E. Sri Kumar
G. Pranav
(19BD1A05C1)
(19BD1A05C5)
(19BD1A05CJ)
(19BD1A05CK)
In partial fulfilment for the award of Bachelor of Technology in Computer Science and Engineering
affiliated to the Jawaharlal Nehru Technological University, Hyderabad during the year 2022-23.
Internal Guide Head of the Department
(Ms. Nasreen Sultana) (Dr. S. Padmaja)
Submitted for Viva Voice Examination held on
External Examiner
Unit of Keshav Memorial Educational Society
#: 3-5-1026 Narayanaguda Hyderabad 500029.
040-3261407 www.kmit.in e-mail: principal@kmit.in
4. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Vision of KMIT
Producing quality graduates trained in the latest technologies and related tools and striving to make India a
world leader in software and hardware products and services. To achieve academic excellence by imparting
in depth knowledge to the students, facilitating research activities and catering to the fast growing and ever-
changing industrial demands and societal needs.
Mission of KMIT
To provide a learning environment that inculcates problem solving skills, professional, ethical
responsibilities, lifelong learning through multi modal platforms and prepare students to become
successful professionals.
To establish industry institute Interaction to make students ready for the industry.
To provide exposure to students on latest hardware and software tools.
To promote research based projects/activities in the emerging areas of technology convergence.
To encourage and enable students to not merely seek jobs from the industry but also to create new
enterprises.
To induce a spirit of nationalism which will enable the student to develop, understand lndia's
challenges and to encourage them to develop effective solutions.
To support the faculty to accelerate their learning curve to deliver excellent service to students.
5. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Vision & Mission of CSE
Vision of the CSE
To be among the region's premier teaching and research Computer Science and Engineering departments
producing globally competent and socially responsible graduates in the most conducive academic
environment.
Mission of the CSE
To provide faculty with state of the art facilities for continuous professional development and
research, both in foundational aspects and of relevance to emerging computing trends.
To impart skills that transform students to develop technical solutions for societal needs and
inculcate entrepreneurial talents.
To inculcate an ability in students to pursue the advancement of knowledge in various specializations
of Computer Science and Engineering and make them industry-ready.
To engage in collaborative research with academia and industry and generate adequate resources for
research activities for seamless transfer of knowledge resulting in sponsored projects and
consultancy.
To cultivate responsibility through sharing of knowledge and innovative computing solutions that
benefit the society-at-large.
To collaborate with academia, industry and community to set high standards in academic excellence
and in fulfilling societal responsibilities.
6. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
PROGRAM OUTCOMES (POs)
1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals and
an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify formulate, review research literature, and analyse complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences
3. Design/development of solutions: Design solutions for complex engineering problem and design system
component or processes that meet the specified needs with appropriate consideration for the public health
and safety, and the cultural societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.
5. Modern tool usage: Create select, and, apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modelling to complex engineering activities with an understanding of
the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to societal, health,
safety. legal und cultural issues and the consequent responsibilities relevant to professional engineering
practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one's own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
7. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
PROGRAM SPECIFIC OUTCOMES (PSOs)
PSO1: An ability to analyse the common business functions to design and develop appropriate Information
Technology solutions for social upliftment.
PSO2: Shall have expertise on the evolving technologies like Python, Machine Learning, Deep Learning, Internet of
Things (IOT), Data Science, Full stack development, Social Networks, Cyber Security, Big Data, Mobile Apps, CRM,
ERP etc..
PROGRAM EDUCATIONAL OBJECTIVES (PEOs)
PEO1: Graduates will have successful careers in computer related engineering fields or will be able to
successfully pursue advanced higher education degrees.
PEO2: Graduates will try and provide solutions to challenging problems in their profession by applying
computer engineering principles.
PEO3: Graduates will engage in life-long learning and professional development by rapidly adapting
changing work environment.
PEO4: Graduates will communicate effectively, work collaboratively and exhibit high levels of
professionalism and ethical responsibility.
8. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
PROJECT OUTCOMES
P1: To provide a friendly environment to the user
P2: To predict dependent variable for given user input data(features).
P3: To give the accurate price for used cars.
P4: Developing web applications using flask-framework.
LOW - 1
MEDIUM - 2
HIGH - 3
PROJECT OUTCOMES MAPPING PROGRAM OUTCOMES
PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
P1
3 3 2 2
P2
2 2 2 2 2 1
P3
2 2 3 2 2 2
P4
1 2 3 2 2 1
PROJECT OUTCOMES MAPPING PROGRAM SPECIFIC OUTCOMES
PSO PSO1 PSO2
P1
1
P2
3
10. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
DECLARATION
We hereby declare that the project report entitled ―CAR PRICE PREDICTION USING
LINEAR REGRESSION (M.L)” is done in the partial fulfillment for the award of the Degree in
Bachelor of Technology in Computer Science and Engineering affiliated to Jawaharlal Nehru
Technological University, Hyderabad. This project has not been submitted anywhere else.
ABBI DINESH (19BD1A05C1)
ASAD RAHUL (19BD1A05C5)
ERRAGALA SRI KUMAR (19BD1A05CJ)
GYARA PRANAV KUMAR (19BD1A05CK)
11. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
ACKNOWLEDGMENT
We take this opportunity to thank all the people who have rendered their full support
to our project work.
We render our thanks to Dr. Maheshwar Dutta, B.E., M Tech., Ph.D., Principal who
encouraged us to do the Project.
We are grateful to Mr. Neil Gogte, Director for facilitating all the amenities required
for carrying out this project.
We express our sincere gratitude to Mr. S. Nitin, Director and Dr. D. Jaya Prakash,
Dean Academics for providing an excellent environment in the college.
We are also thankful to Dr. S. Padmaja, Head of the Department for providing us
with both time and amenities to make this project a success within the given schedule.
We are also thankful to our guide Ms. Nasreen Sultana, for her valuable guidance
and encouragement given to us throughout the project work.
We would like to thank the entire CSE Department faculty, who helped us directly
and indirectly in the completion of the project. We sincerely thank our friends and family
for their constant motivation during the project work.
ABBI DINESH (19BD1A05C1)
ASAD RAHUL (19BD1A05C5)
ERRAGALA SRI KUMAR (19BD1A05CJ)
GYARA PRANAV KUMAR (19BD1A05CK)
12. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
CONTENT
DESCRIPTION PAGE NO.
ABSTRACT i
LIST OF FIGURES ii
LIST OF TABLES iii
CHAPTERS
1. INTRODUCTION 1-14
1.1. Machine Learning 1
1.2. What is Machine Learning 1
1.3. Types of Machine Learning 3
1.4. Linear Regression 8
1.5. Objective & Problem Statement 13
1.6. Purpose of Project 13
1.7. Architecture Diagram 14
1.8. Project Goal 14
2. SOFTWARE
REQUIREMENTS
SPECIFICATIONS
15-16
2.1. Requirements Specification Document 16
2.2. Functional Requirements 17
2.3. Non-Functional Requirements 17
2.4. Software Requirements 18
2.5. Hardware Requirements 18
2.6. Requirement Analysis 19
2.7. Test Construction and verification 20
2.8. Test Execution and Bug Reporting 20
2.9. Final Testing and Implementation 20
2.10. Post Implementation 20
2.11. Technologies used 21
13. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
CAR PRICE PREDICTION
3. LITERATURE SURVEY
3.1. Proposed Model
3.2. Paper Work
3.3. Related Work
24-27
25
26
27
4. SYSTEM DESIGN 28-33
4.1. Introduction to UML 29
4.2. UML Diagrams 29
4.2.1. Use Case diagram 29
4.2.2. Sequence diagram 31
4.2.3. Class diagram 33
4.2.4. System Design 34
4.2.5. State Chart Diagram 36
5. IMPLEMENTATION 38-59
5.1. Pseudo code 39
5.2. Data Cleaning using Google Colab 40
5.2. Code Snippets 52
6. TESTING 60-72
6.1. Introduction to Testing 61
6.2. Test Cases 63
7. SCREENSHOTS 73-75
7.1. Layout of Testing Platform 74
7.2. Log & Reference 74
7.3. UI of Web Application 75
8.FURTHER ENHANCEMENTS 76
9.CONCLUSION 78
10.REFERENCES 80
14. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
ABSTRACT
In this fast-moving generation, the present study proposes the newer concept of
predicting the prices of certain items. With an idea and motivation to help everyone we
came up with a solution to get an appropriate estimate of one’s car using Machine
Learning Techniques which will save a lot of time and money. A car price prediction has
been a high interest research area, as it requires noticeable effort and knowledge of the
field expert. Considerable number of distinct attributes is examined for the reliable and
accurate prediction. The production of cars has been steadily increasing in the past
decade, with over 70 million passenger cars being produced in the year 2016. This has
given rise to the used car market, which on its own has become a booming industry. The
recent advent of online portals has facilitated the need for both the customer and the
seller to be better informed about the trends and patterns that determine the value of a
used car in the market. To build a model for predicting the price of used cars in, we
applied one of the machine learning techniques i.e., Linear Regression. Using linear
regression, there are multiple independent variables, but one and only one dependent
variable whose actual and predicted values are compared to find precision of results. Our
paper proposes a system where price is dependent variable which is predicted, and this
price is derived from factors like kilometers driven, car purchase year, Car Company, car
model, and the fuel type.
Keywords: Car Price Prediction, Linear Regression, Machine Learning, dependent
variable etc.
15. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
LIST OF FIGURES
LIST OF FIGURES PAGE NO
1.1 Machine Learning 1
1.2 Machine Learning & Traditional
Programming
2
1.3 Types of Machine Learning 3
1.3.1 Data Set of Supervised Learning 3
1.3.1.2 Types of Supervised Learning 4
1.3.2 Unsupervised 5
1.3.2.1 Types of Unsupervised Learning 6
1.3.4 Reinforcement Learning 7
1.4 Linear Regression 8
1.7 Architecture of Linear Regression’ 14
3.8.1 Google colab 22
4.2.1 Use Case Diagram -UML 30
4.2.2 Sequence Diagram –UML 32
4.2.3 Class Diagram –UML 33
4.2.4 System Design-UML 35
4.2.5 State Chart Diagram –UML 37
7.1 Selenium IDE Testing Platform 74
7.2 Log & Reference using Selenium IDE 74
16. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
CAR PRICE PREDICTION
7.3 Register page of Web Application -UI 75
7.4 Login page of Web Application -UI 75
7.5 Home page of Web Application-UI 76
7.6 Displaying available car companies -UI 76
7.7 Displaying suitable car models -UI 77
7.8 Displaying available years -UI 77
7.9 Displaying available Fuel Types- UI 78
7.10 Displaying Predicted Price -UI 78
17. lOMoAR cPSD|24598226
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
LIST OF TABLES
6.2 Test Case for Web Application 62
6.2.1 Launching web application 62
6.2.2 Registration of user details 64
6.2.3 Login Positive test case 65
6.2.4 Login Negative test case 66
6.2.5 Displaying Attributes 66
6.2.6 Selecting Attributes 68
6.2.7 Selecting attributes for correct attributes 69
6.2.8 Selecting attributes for incorrect attributes 70
6.2.9 Home button Test case 71
6.2.10 Logout button Test case 72
19. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Figure-1.1 Machine Learning
Machine Learning?
1. INTRODUCTION
1.1 MACHINE LEARNING
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would have
ever come across. As it is evident from the name, it gives the computer that makes it more
similar to humans: The ability to learn. Machine learning is actively being used today, perhaps in
many more places than one would expect.
1.2 What is
Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term ―Machine Learning‖. He defined machine learning as – a ―Field of study that gives
computers the capability to learn without being explicitly programmed‖. In a very layman’s
manner, Machine Learning (ML) can be explained as automating and improving the learning
process of computers based on their experiences without being actually programmed i.e. without
any human assistance. The process starts with feeding good quality data and then training our
machines(computers) by building machine learning models using the data and different
algorithms. The choice of algorithms depends on what type of data do we have and what kind of
task we are trying to automate. Example: Training of students during exams. While preparing
for the exams students don’t actually cram the subject but try to learn it with complete
understanding. Before the examination, they feed their machine(brain) with a good amount of
high-quality data (questions and answers from different books or teachers’ notes, or online video
lectures).
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 1
20. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Figure 1.2 Machine Learning & Traditional Programming
Actually, they are training their brain with input as well as output i.e, what kind of approach or
logic do they have to solve a different kinds of questions. Each time they solve practice test
papers and find the performance (accuracy /score) by comparing answers with the answer key
given, Gradually, the performance keeps on increasing, gaining more confidence with the adopted
approach. That’s how actually models are built, train machine with data (both inputs and outputs
are given to the model), and when the time comes test on data (with input only) and achieve our
model scores by comparing its answer with the actual output which has not been fed while
training. Researchers are working with assiduous efforts to improve algorithms, and techniques so
that these models perform even much better.
1.2.1 Basic Difference in ML and Traditional Programming?
Traditional Programming: We feed in DATA (Input) + PROGRAM (logic), run it on
the machine, and get the output.
Machine Learning: We feed in DATA (Input) + Output, run it on the machine during
training and the machine creates its own program (logic), which can be evaluated while
testing.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 2
21. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
1.3 ML | Types of Machine Learning
A machine is said to be learning from past experiences (data feed-in) with respect
to some class of tasks if its Performance in a given Task improves with the Experience.
For example, assume that a machine has to predict whether a customer will buy a specific
product let’s say ―Antivirus‖ this year or not. The machine will do it by looking at the
previous knowledge/past experiences i.e the data of products that the customer had
bought every year and if he buys Antivirus every year, then there is a high probability
that the customer is going to buy an antivirus this year as well. This is how machine
learning works at the basic conceptual level.
Figure 1.3 Types of Machine Learning
1.3.1 Supervised Learning
Supervised learning is when the model is getting trained on a labeled dataset. A labeled
dataset is one that has both input and output parameters. In this type of learning training
and validation, datasets are labeled as shown in the figures below.
Example
Figure 1.3.1 Data Set
Both the above figures have labeled data set as follows:
Figure A: It is a dataset of a shopping store that is useful in predicting whether a
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 3
22. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
customer will purchase a particular product under consideration or not based on his/ her
gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that
the customer won’t purchase it.
Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
1.3.1 Types of Supervised Learning:
A. Classification:
Figure 1.3.1 Types of Supervised Learning
It is a Supervised Learning task where output is having defined labels (discrete
value). For example in above Figure A, Output – Purchased has defined labels i.e. 0 or 1;
1 means the customer will purchase, and 0 means that the customer won’t purchase. The
goal here is to predict discrete values belonging to a particular class and evaluate them on
the basis of accuracy.
It can be either binary or multi-class classification. In binary classification, the model
predicts either 0 or 1; yes or no but in the case of multi-class classification, the model
predicts more than one class. Example: Gmail classifies mails in more than one class like
social, promotions, updates, and forums.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 4
23. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
B. Regression:
It is a Supervised Learning task where output is having continuous value.
For example in above Figure B, Output – Wind Speed is not having any discrete value
but is continuous in a particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating the
error value. The smaller the error, the greater the accuracy of our regression model.
Example of Supervised Learning Algorithms:
Linear Regression
Logistic Regression
Nearest Neighbor
Gaussian Naive Bayes
Decision Trees
Support Vector Machine (SVM)
Random Forest
1.3.2 Unsupervised Learning:
Unsupervised machine learning analyzes and clusters unlabeled datasets using
machine learning algorithms. These algorithms find hidden patterns and data without any
human intervention, i.e., we don’t give output to our model. The training model has only
input parameter values and discovers the groups or patterns on its own. Data-set in
Figure A is Mall data that contains information about its clients that subscribe to them.
Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and
unsupervised learning techniques, the mall can easily group clients based on the
parameters we are feeding in.
Figure 1.3.2 Unsupervised Learning
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 5
24. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
The input to the unsupervised learning models is as follows:
Unstructured data: May contain noisy (meaningless) data, missing values, or unknown
data
1.3.2.1 Types of Unsupervised Learning are as follows:
Figure 1.3.2.1 Types of Unsupervised
Clustering: Broadly this technique is applied to group data based on different patterns,
such as similarities or differences, our machine model finds. These algorithms are used to
process raw, unclassified data objects into groups. For example, in the above figure, we
have not given output parameter values, so this technique will be used to group clients
based on the input parameters provided by our data.
Association: This technique is a rule-based ML technique that finds out some very useful
relations between parameters of a large data set. This technique is basically used for
market basket analysis that helps to better understand the relationship between different
products. For e.g. shopping stores use algorithms based on this technique to find out the
relationship between the sale of one product w.r.t to another’s sales based on customer
behavior. Like if a customer buys milk, then he may also buy bread, eggs, or butter. Once
trained well, such models can be used to increase their sales by planning different offers.
Some algorithms: K-Means Clustering
DBSCAN – Density-Based Spatial Clustering of Applications with Noise
BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies Hierarchical
Clustering
1.3.3 Semi-supervised Learning:
As the name suggests, its working lies between Supervised and Unsupervised
techniques. We use these techniques when we are dealing with data that is a little bit
labeled and the rest large portion of it is unlabeled. We can use the unsupervised
techniques to predict labels and then feed these labels to supervised techniques.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 6
25. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
This technique is mostly applicable in the case of image data sets where usually all
images are not labeled.
1.3.4 Reinforcement Learning:
In this technique, the model keeps on increasing its performance using Reward
Feedback to learn the behavior or pattern. These algorithms are specific to a particular
problem e.g. Google Self Driving car, AlphaGo where a bot competes with humans and
even itself to get better and better performers in Go Game. Each time we feed in data,
they learn and add the data to their knowledge which is training data. So, the more it
learns the better it gets trained and hence experienced.
Figure1.3.4 Reinforcement
Agents observe input.
An agent performs an action by making some decisions.
After its performance, an agent receives a reward and accordingly reinforces and
the model
stores in state-action pair of information.
Temporal Difference (TD)
Q-Learning and Deep Adversarial Networks.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 7
26. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
1.4 ML | Linear Regression
In statistics, linear regression is a linear approach for modelling the relationship
between a scalar response and one or more explanatory variables (also known as
dependent and independent variables). The case of one explanatory variable is called
simple linear regression; for more than one, the process is called multiple linear
regression. This term is distinct from multivariate linear regression, where multiple
correlated dependent variables are predicted, rather than a single scalar variable.
In linear regression, the relationships are modeled using linear predictor functions whose
unknown model parameters are estimated from the data. Such models are called linear
models. Most commonly, the conditional mean of the response given the values of the
explanatory variables (or predictors) is assumed to be an affine function of those values;
less commonly, the conditional median or some other quantile is used. Like all forms of
regression analysis, linear regression focuses on the conditional probability distribution
of the response given the values of the predictors, rather than on the joint probability
distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to
be used extensively in practical applications. This is because models which depend
linearly on their unknown parameters are easier to fit than models which are non-linearly
related to their parameters and because the statistical properties of the resulting estimators
are easier to determine.
Figure 1.4 Linear Regression
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 8
27. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
i=1
i
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
If the goal is prediction, forecasting, or error reduction,[clarification needed] linear
regression can be used to fit a predictive model to an observed data set of values of
the response and explanatory variables. After developing such a model, if additional
values of the explanatory variables are collected without an accompanying response
value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to
variation in the explanatory variables, linear regression analysis can be applied to
quantify the strength of the relationship between the response and the explanatory
variables, and in particular to determine whether some explanatory variables may
have no linear relationship with the response at all, or to identify which subsets of
explanatory variables may contain redundant information about the response.
Linear regression models are often fitted using the least squares approach, but they may
also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm
(as with least absolute deviations regression), or by minimizing a penalized version of the
least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm
penalty). Conversely, the least squares approach can be used to fit models that are not
linear models. Thus, although the terms "least squares" and "linear model" are closely
linked, they are not synonymous.
Given a data set *𝑦i𝑥i1, . . . 𝑥i𝑝+ of n statistical units, a linear regression model assumes
that the relationship between the dependent variable y and the p-vector of regressors x is
linear. This relationship is modeled through a disturbance term or error variable ε — an
unobserved random variable that adds "noise" to the linear relationship between the
dependent variable and regressors. Thus the model takes the form
𝑦i = 𝛽0 + 𝛽1𝑥i1+ . . . + 𝛽𝑝𝑥i𝑝 + si = 𝑥𝑇𝛽 + si, i =1, …n
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 9
28. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β.
Often these n equations are stacked together and written in matrix notation as
𝑦 = 𝑥𝛽 + s,
The very simplest case of a single scalar predictor variable x and a single scalar response
variable y is known as simple linear regression. The extension to multiple and/or vector-
valued predictor variables (denoted with a capital X) is known as multiple linear
regression, also known as multivariable linear regression (not to be confused with
multivariate linear regression.
Multiple linear regression is a generalization of simple linear regression to the case of
more than one independent variable, and a special case of general linear models,
restricted to one dependent variable. The basic model for multiple linear regression is
𝑌i = 𝛽0 + 𝛽1𝑥i1+.... 𝛽𝑝𝑥i𝑝 + si
for each observation i = 1, ..., n.
In the formula above we consider n observations of one dependent variable and p
independent variables. Thus, Yi is the ith observation of the dependent variable, Xij is ith
observation of the jth independent variable, j = 1, 2, ..., p. The values βj represent
parameters to be estimated, and εi is the ith independent identically distributed normal
error.
In the more general multivariate linear regression, there is one equation of the above
form for each of m > 1 dependent variables that share the same set of explanatory
variables and hence are estimated simultaneously with each other:
𝑌ij = 𝛽0j + 𝛽1j𝑥i1+ .... 𝛽𝑝j𝑥i𝑝 + sij
for all observations indexed as i = 1,.... , n and for all dependent variables indexed as j =
1, ...., m.
Nearly all real-world regression models involve multiple predictors, and basic
descriptions of linear regression are often phrased in terms of the multiple regression
model. Note, however, that in these cases the response variable y is still a scalar. Another
term, multivariate linear regression, refers to cases where y is a vector, i.e., the same as
general linear regression.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 10
29. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
1
∑
𝑁
1.4.1 Type of loss in a linear model:
L1 loss: This is the difference between the predicted and actual values. It is also called
mean absolute error (MAE).
The model will calculate all the MAE values and add them to find the total L1 Loss. The
formula of L1 loss is shown below.
𝑀𝐴𝐸 =
1
∑ |
𝑦 − 𝑦
̂|
𝑁 i=1 i
where, 𝑦
̂ i𝑠 𝑝𝑟e𝑑i𝑐𝑡e𝑑 𝑣𝑎𝑙𝑢e of 𝑦
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
L2 Loss: In this loss, we take the squared average difference between the predicted and
actual value. It is also known as Mean Squared Error (MSE). The formula of L2 loss is
shown below.
𝑀𝑆𝐸 =
1
∑𝑁 (
𝑦 − 𝑦
̂)2
𝑁 i=1 i
where, 𝑦
̂ i𝑠 𝑝𝑟e𝑑i𝑐𝑡e𝑑 𝑣𝑎𝑙𝑢e of 𝑦
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
RSME Error: It tells the error rate by the square root of the L2 loss i.e. MSE. The
formula of RSME is shown below.
𝑅𝑆𝑀𝐸 = √𝑀𝑆𝐸 = √ (𝑦 − 𝑦
̂)2
𝑁 i=1 i
Where, 𝑦
̂ i𝑠 𝑝𝑟e𝑑i𝑐𝑡e𝑑 𝑣𝑎𝑙𝑢e of 𝑦
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
R-squared error: It tells the good fit of the model-predicted line with the actual values
of data. The coefficient value range is from 0 to 1 i.e. the value close to 1 is a well-fitted
line. The formula is shown below.
𝑅2
= 1 −
∑(𝑦i−𝑦
̂)2
∑(𝑦i−𝑦)2
Where, 𝑦
̂ i𝑠 𝑝𝑟e𝑑i𝑐𝑡e𝑑 𝑣𝑎𝑙𝑢e of 𝑦
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 11
30. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
𝑦 i𝑠 𝑚e𝑎𝑛 𝑣𝑎𝑙𝑢e of 𝑦
Note: In the case of an outlier, we can use L1 losses because with L2 loss the error is
being squared to give more loss value. We can remove the outlier from the first and then
can use L2 loss.
Learning Rate:
The alpha is the learning rate in the gradient descent formula as we seen above. It
functions of the alpha to control the speed of the gradient descent to get the minima point.
The value of alpha should be optimal so that it won’t miss the minima point or take time
to reach the minima point.
∂𝐿
𝜃𝑛ew = 𝜃o𝑙𝑑 − 𝛼
o𝑙𝑑
1.4.2 Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE
value) and achieving the best fit line the model uses Gradient Descent. The idea is to start
with random θ1 and θ2 values and then iteratively updating the values, reaching
minimum cost.
1.4.3 One Hot Encoding:
Most Machine Learning algorithms cannot work with categorical data and needs
to be converted into numerical data. Sometimes in datasets, we encounter columns that
contain categorical features (string values) for example parameter Gender will have
categorical parameters like Male, Female. These labels have no specific order of
preference and also since the data is string labels, machine learning models
misinterpreted that there is some sort of hierarchy in them.
One approach to solve this problem can be label encoding where we will assign a
numerical value to these labels for example Male and Female mapped to 0 and 1. But this
can add bias in our model as it will start giving higher preference to the Female parameter
as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue
we will use One Hot Encoding technique.
In this technique, the categorical parameters will prepare separate columns for both Male
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 12
∂𝜃
31. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0
in Female column, and vice-versa. Let’s understand with an example: Consider the data
where fruits and their corresponding categorical values and prices are given.
1.5 Objective & Problem Statement
Objective Of the Project - The goal of this project is to create an efficient and
effective model that will be able to predict the price of a used car by using the Linear
Regression algorithm with better accuracy.
Brand or Type of the car one prefers like Ford, Hyundai
Model of the car namely Ford Figo, Hyundai Creta
Year of manufacturing like 2020, 2021
Type of fuel namely Petrol, Diesel
Number of kilometers car has travelled
Problem Statement - It is easy for any company to price their new cars based on the
manufacturing and marketing cost it involves. But when it comes to a used car it is quite
difficult to define a price because it involves it is influenced by various parameters like
car brand, manufactured year and etc. The goal of our project is to predict the best price
for a pre-owned car in the Indian market based on the previous data related to sold cars
using Linear Regression.
1.6 Purpose of Project
The used car market is an ever-rising industry, which has almost doubled its market
value in the last few years. The emergence of online portals such as CarDheko, Quikr,
Carwale, Cars24, and many others has facilitated the need for both the customer and the
seller to be better informed about the trends and patterns that determine the value of the
used car in the market. Machine Learning algorithms can be used to predict the retail
value of a car, based on a certain set of features. The purpose of this project is to provide
Car price prediction using machine learning without any human interference.
In our day to day lives everyone buys and sells a car every day. Now there are
limited facilities and applications to get an appropriate price for one’s car. Now we use
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 13
32. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
this application to get an estimate value of the car.
1.7 Architecture Diagram
Fig 1.7 – Architecture of Linear Regression (M.L)
1.8 Project Goal
We are required to model the price of cars with the available independent
variables. It will be used by the management to understand how exactly the prices vary
with the independent variables. They can accordingly manipulate the design of the cars,
the business strategy etc. to meet certain price levels. Further, the model will be a good
way for management to understand the pricing dynamics of a new market.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 14
33. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 15
CHAPTER -2
34. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
2. SYSTEM REQUIREMENT SPECIFICATIONS
2.1What is SRS?
Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of
the entire system cannot be easily comprehended. Hence the need for the requirement
phase arose. The software project is initiated by the client needs. The SRS is the means
of translating the ideas of the minds of clients (the input) into a formal document
(theoutput of the requirement phase.)
The SRS phase consists of two basic activities:
Problem/Requirement Analysis:
The process is order and more nebulous of the two, deals with understand the
problem,the goal and constraints.
Requirement Specification:
Here, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications are
addressed during this activity.
The Requirement phase terminates with the production of the validate SRS
document. Producing the SRS document is the basic goal of this phase.
2.1.1 Role of SRS:
The purpose of the Software Requirement Specification is to reduce the
communication gap between the clients and the developers. Software Requirement
Specification is the medium though which the client and user needs are
accurately specified. It forms the basis of software development. A good SRS should
satisfy all the parties involved in the system.
2.2Requirements Specification Document
A Software Requirements Specification (SRS) is a document that describes the
nature of a project, software or application. In simple words, SRS document is a manual
of a project provided it is prepared before you kick-start a project/application. This
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 16
35. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
document is also known by the names SRS report, software document. A software
document is primarily prepared for a project, software or any kind of application.
There are a set of guidelines to be followed while preparing the software requirement
specification document. This includes the purpose, scope, functional and non-functional
requirements, software and hardware requirements of the project. In addition to this, it
also contains the information about environmental conditions required, safety and
security requirements, software quality attributes of the project etc.
The purpose of SRS (Software Requirement Specification) document is to describethe
external behavior of the application developed or software. It defines the operations,
performance and interfaces and quality assurance requirement of the application or
software. The complete software requirements for the system are captured by the SRS.
This section introduces the requirement specification document for Car Price Prediction
using linear Regression which enlists functional as well as non-functional requirements.
2.2 Functional Requirements
For documenting the functional requirements, the set of functionalities supported by
the system are to be specified. A function can be specified by identifying the state at
which data is to be input to the system, its input data domain, the output domain, and the
type of processing to be carried on the input data to obtain the output data. Functional
requirements define specific behavior or function of the application. Following are the
functional requirements:
FR1) After Registration the details should store in MySQL.
FR2) Entering Login details should show the user’s data .
FR3) The login page should redirect to next page(home).
FR4) The attributes should be shown after redirecting to home page.
FR5) After Entering attributes the price prediction should be shown.
2.3 Non-Functional Requirements
A non-functional requirement is a requirement that specifies criteria that can be used
to judge the operation of a system, rather than specific behaviors. Especially these are
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 17
36. lOMoAR cPSD|24598226
CAR PRICE PREDICTION
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
the constraints the system must work within. Following are the non-functional
requirements:
NFR 1) Must be able to work properly without bugs.
NFR 2) Should not be any lag showing the price
NFR 3) The database should access proper user data.
NFR 4) Attributes must be displayed properly to user.
2.3.1 Performance:
The performance of the developed applications can be calculated by using following
methods: Measuring enables you to identify how the performance of your application
stands in relation to your defined performance goals and helps you to identify the
bottlenecks that affect your application performance. It helps you identify whether your
application is moving toward or away from your performance goals. Defining what you
will measure, that is, your metrics, and defining the objectives for each metric is a
critical part of your testing plan.
Performance objectives include the following:
Response time, Latency throughput or Resource utilization.
2.4 Software Requirements
Operating System : Windows 10/11 or MAC OS.
Platform : Google colab, PyCharm IDE
Programming Language : Python, SQL
2.5 Hardware Requirements
Processor : Intel core i3 and above.
Hard Disk : 1 TB or above.
RAM : 4 GB or above.
Internet : 1 Mbps or above (Wireless).
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 18
37. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
What is SRS ?
The process of testing a software in a well-planned and systematic way is known
as software testing lifecycle (STLC). Different organizations have different
phases in STLC however generic Software Test Life Cycle (STLC) for waterfall
development model consists of the following phases:
1.Requirements Analysis
2.Test Planning
3.Test Analysis
4.Test Design
5.Test Construction and Verification
6.Test Execution and Bug Reporting
7.Final Testing and Implementation
8.Post Implementation
2.6 Requirements Analysis
In this phase testers analyses the customer requirements and work with developers
during the design phase to see which requirements are testable and how they are going to
test those requirements. It is very important to start testing activities from the
requirements phase itself because the cost of fixing defect is very less if it is found in
requirements phase rather than in future phases. In this phase all the planning about
testing is done like what needs to be tested, how the testing will be done, test strategy to
be followed, what will be the test environment, what test methodologies will be
followed, hardware and software availability, resources, risks etc. A high level test plan
document is created which includes all the planning inputs mentioned above and
circulated to the stakeholders.
2.7 Test Construction and Verification
In this phase testers prepare more test cases by keeping in mind the positive and
negative scenarios, end user scenarios etc. All the test cases and automation scripts need
to be completed in this phase and got reviewed by the stakeholders. The test plan
document should also be finalized and verified by reviewers.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 19
38. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
2.8 Test Execution and Bug Reporting
Once the unit testing is done by the developers and test team gets the test build, The
test cases are executed and defects are reported in bug tracking tool, after the test
execution is complete and all the defects are reported. Test execution reports are created
and circulated to project stakeholders. After developers fix the bugs raised by testers
theygive another build with fixes to testers, testers do re-testing and regression testing to
ensure that the defect has been fixed and not affected any other areas of software.
Testing is an iterative process i.e. If defect is found and fixed, testing needs to be done
after every defect fix. After tester assures that defects have been fixed and no more
critical defects remain in software the build is given for final testing.
2.9Final Testing and Implementation
In this phase the final testing is done for the software, non-functional testing like
stress, load and performance testing are performed in this phase. The software is also
verified in the production kind of environment. Final test execution reports and
documents are prepared in this phase.
2.10 Post Implementation
In this phase the test environment is cleaned up and restored to default state, the
process review meetings are done and lessons learnt are documented. A document is
prepared to cope up similar problems in future releases.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 20
39. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Phase Activities Outcome
Planning
Create high level test
plan
Test plan, Refined
Specification
Analysis
Create detailed testplan,
Functional
Revised Test Plan,
Functional Validation
Matrix, test cases
Validation Matrix, test
cases
Design
Test cases are revised,
select which test cases
to automate
Revised test cases, test
data sets,
risk
assessment sheet.
Construction
Scripting of test cases
to automate
Test
procedures/Scripts,
Drivers, test
results,
Bug reports
Testing cycles
Complete testing
cycles
Test results, Bug
reports
Final testing
Execute remainingstress and
performancetests, complete
documentation
Test results and
different metrics on
test efforts
Post implementation
Evaluate testing
processes
Plan for improvement
of testing process
Table 3.7 – Activities and Outcomes of each phase in SDLC
2.11 Technologies Used:
2.11.1 Google Colab
Colaboratory, or ―Colab‖ for short, is a product from Google Research. Colab allows
anybody to write and execute arbitrary python code through the browser, and is especially
well suited to machine learning, data analysis and education. More technically, Colab is a
hosted Google colab service that requires no setup to use, while providing access free of
charge to computing resources including GPUs.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 21
40. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Is Google colab like Google colab?
Google Colab's major differentiator from Google colab is that it is cloud-based and
Jupyter is not. This means that if you work in Google Collab, you do not have to worry
about downloading and installing anything to your hardware.
Fig 3.8.1 – Google colab
2.11.2 PyCharm IDE
PyCharm is a dedicated Python Integrated Development Environment (IDE)
providing a wide range of essential tools for Python developers, tightly integrated to
create a convenient environment for productive Python, web, and data science
development.
JetBrains s.r.o. (formerly IntelliJ Software s.r.o.) is a Czech software development
company which makes tools for software developers and project managers. The company
offers integrated development environments (IDEs) for the programming languages Java,
Groovy, Kotlin, Ruby, Python, PHP, C, Objective-C, C++, C#, F#, Go, JavaScript, and
the domain-specific language SQL.
2.11.3 SQL
SQL (Structured Query Language) is a powerful and standard query language for
relational database systems. We use SQL to perform CRUD (Create, Read, Update,
Delete) operations on databases along with other various operations. SQL has evolved a
lot in the past decade.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 22
41. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
utilities, or as parts of other applications.
RDBMS
RDBMS stands for Relational Database Management System. RDBMS is the basis for
SQL, and for all modern database systems such as MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access. The data in RDBMS is stored in database objects called
tables. A table is a collection of related data entries and it consists of columns and rows.
Although SQL is an ANSI/ISO standard, there are different versions of the SQL
language. However, to be compliant with the ANSI standard, they all support at least the
major commands (such as SELECT, UPDATE, DELETE, INSERT, WHERE) in a
similar manner.
MySQL, the most popular Open Source SQL database management system, is
developed, distributed, and supported by Oracle Corporation.
MySQL is a database management system.
A database is a structured collection of data. It may be anything from a simple shopping
list to a picture gallery or the vast amounts of information in a corporate network. To add,
access, and process data stored in a computer database, you need a database management
system such as MySQL Server. Since computers are very good at handling large amounts
of data, database management systems play a central role in computing, as standalone
Using SQL in Your Web Site
To build a web site that shows data from a database, you will need:
An RDBMS database program (i.e. MS Access, SQL Server, MySQL)
To use a server-side scripting language, like PHP or python
To use SQL to get the data you want
To use HTML / CSS to style the page
2.11.4 Flask
Flask is a micro web framework written in Python. It is classified as a micro
framework because it does not require particular tools or libraries. It has no database
abstraction layer, form validation, or any other components where pre-existing third-party
libraries provide common functions.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 23
42. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 24
CHAPTER -3
43. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
3. LITERATURE SURVEY
3.1 Paper work
Over fitting and under fitting come into picture when we create our statistical
models. The models might be too biased to the training data and might not perform well
on the test dataset. This is called over fitting. Likewise, the models might not take into
consideration all the variance present in the population and perform poorly on a test data
set. This is called underfitting. A perfect balance needs to be achieved between these two,
which leads to the concept of Bias-Variance tradeoff. Pierre Geurts has introduced and
explained how bias-variance tradeoff is achieved in both regression and classification.
The selection of variables/attribute plays a vital role in influencing both the bias and
variance of the statistical model. Robert Tibshirani proposed a new method called Lasso,
which minimizes the residual sum of squares. This returns a subset of attributes which
need to be included in multiple regression to get the minimal error rate. Similarly,
decision trees suffer from overfitting if they are not pruned/shrunk. Trevor Hastie and
Daryl Pregibon have explained the concept of pruning in their research paper. Moreover,
hypothesis testing using ANOVA is needed to verify whether the different groups of
errors really differ from each other. This is explained by TK Kim and Tae Kyun in their
paper. A Post-Hoc test needs to be performed along with ANOVA if the number of
groups exceeds two.
Turkey’s Test has been explored by Haynes W. in his research paper. Using these
techniques, we will create, train and test the effectiveness of our statistical models.
The paper is Predicting the price of Used Car Using Machine Learning Techniques. In
this paper, they investigate the application of supervised machine learning techniques to predict
the price of used cars in Mauritius. The predictions are based on historical data collected from
daily newspapers. Different techniques like multiple linear regression analysis, k-nearest
neighbors, naïve bayes and decision trees have been used to make the predictions.
The paper is Car Price Prediction Using Machine Learning Techniques. Considerable
number of distinct attributes is examined for the reliable and accurate prediction. To build
a model for predicting the price of used cars in Bosnia and Herzegovina, they have
applied three machine learning techniques (Artificial Neural Network, Support Vector
Machine and Random Forest).
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 25
44. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
The paper is Price Evaluation model in second hand car system based on BP neural
networks. In this paper, the price evaluation model based on big data analysis is
proposed, which takes advantage of widely circulated vehicle data and a large number of
vehicle transaction data to analyze the price data for each type of vehicles by using the
optimized BP neural network algorithm. It aims to established second-hand car price
evaluation model to get the price that best matches the car.
3.2 PROPOSED MODEL
Null Hypothesis
Even though the magnitude of over fitting has been reduced, Regression trees still suffer
from over fitting even after Pruning. This leads to our following hypothesis.
Hypothesis: Multiple and Lasso Regressions are better at predicting price than the
Regression Tree.
Training and Testing Data
The data is split into training (70% - 563 records) and testing (30% - 241 records) data
sets through random sampling (seed was set to 2786).
Linear Regression
In statistics, linear regression is a linear approach for modelling the relationship between
a scalar response and one or more explanatory variables (also known as dependent and
independent variables). The case of one explanatory variable is called simple linear
regression; for more than one, the process is called multiple linear regression. This term
is distinct from multivariate linear regression, where multiple correlated dependent
variables are predicted, rather than a single scalar variable.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 26
45. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
3.3 Related work
Researchers more often predict prices of products using some previous data and
so did Pudaruth who predicted prices of cars in Mauritius and these cars were new rather
second hand. He used multiple linear regression, k-nearest neighbors, naïve Bayes and
decision trees algorithm in order to predict the prices. The comparison of prediction
results from these techniques showed that the prices from these methods are closely
comparable. However, it was found that decision tree algorithm and naïve bayes method
were unable to classify and predict numeric values. Pudaruth’s research also concluded
that limited number of instances in data set do not offer high prediction accuracies.
Multivariate regression model helps in classifying and predicting values of numeric
format. Kuiper used this model to predict price of 2005 General Motor (GM) cars. The
price prediction of cars does not require any special knowledge so the data available
online is enough to predict prices like the data available on www.pakwheels.com. Kuiper
did the same i.e. car price prediction and introduced variable selection techniques which
helped in finding which variables are more relevant for inclusion in model. He
encouraged students to use different models and find how checking model assumptions
work. Another similar research by Listiani uses Support Vector Machines (SVM) to
predict the prices of leased cars. This research showed that SVM is far more accurate in
predicting prices as compared to the multiple linear regression when a very large dataset
is available. SVM also handles high dimensional data better and avoids both the under-
fitting and over-fitting issues. Genetic algorithm is used by Listiani to find important
features for SVM. However, the technique does not show in terms of variance and mean
standard deviation why SVM is better than simple multiple regression.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 27
46. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
CHAPTER -4
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 28
47. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
4. SYSTEM DESIGN
4.1 Introduction to UML
The Unified Modeling Language allows the software engineer to express an analysis
model using the modeling notation that is governed by a set of syntactic, semantic and
pragmatic rules. A UML system is represented using five different views that describe
the system from distinctly different perspective. Each view is defined by a set of
diagram, which is as follows:
1. User Model View
This view represents the system from the users’ perspective. The analysis
representation describes a usage scenario from the end-users’ perspective.
2. Structural Model View
In this model, the data and functionality are arrived from inside the system. This
model view models the static structures.
3. Behavioral Model View
It represents the dynamic of behavioral as parts of the system, depicting he
interactions of collection between various structural elements described in the
user model and structural model view.
4. Implementation Model View
In this view, the structural and behavioral as parts of the system are represented
as they are to be built.
5. Environmental Model View
In this view, the structural and behavioral aspects of the environment in which
the system is to be implemented are represented.
4.2 UML Diagrams
4.2.1 Use Case Diagram
To model a system, the most important aspect is to capture the dynamic behavior. To
clarify a bit in details, dynamic behavior means the behavior of the system when it is
running/operating.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 29
48. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
So only static behavior is not sufficient to model a system rather dynamic
behavior is more important than static behavior. In UML there are five diagrams
available to modeldynamic nature and use case diagram is one of them. Now as
we have to discuss that the use case diagram is dynamic in nature there should
be some internal or external factors for making the interaction.
These internal and external agents are known as actors. So use case diagrams are
consisting of actors, use cases and their relationships. The diagram is used to
model the system/subsystem of an application. A single use case diagram
captures a particular functionality of a system. So to model the entire system
numbers of use case diagramsare used.
Use case diagrams are used to gather the requirements of a system including
internal and external influences. These requirements are mostly design
requirements. So when a system is analysed to gather its functionalities use
cases are prepared and actors are identified. In brief, the purposes of use case
diagrams can be as follows:
a. Used to gather requirements of a system.
b. Used to get an outside view of a system.
c. Identify external and internal factors influencing the system.
d. Show the interacting among the requirements are actors.
Fig 4.2.1 – Use Case Diagram
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 30
49. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
4.2.2 Sequence Diagram
Sequence diagrams describe interactions among classes in terms of an exchange
of messages over time. They're also called event diagrams. A sequence diagram is a good
way to visualize and validate various runtime scenarios. These can help to predict how a
system will behave and to discover responsibilities a class may need to have in the
process of modelling a new system.
The aim of a sequence diagram is to define event sequences, which would have a desired
outcome. The focus is more on the order in which messages occur than on the message
per se. However, the majority of sequence diagrams will communicate what messages
are sent and the order in which they tend to occur.
Basic Sequence Diagram NotationsClass Roles or Participants
Class roles describe the way an object will behave in context. Use the UML object
symbol to illustrate class roles, but don't list object attributes.
Activation or Execution Occurrence
Activation boxes represent the time an object needs to complete a task. When an object
is busy executing a process or waiting for a reply message, use a thin grey rectangle
placed vertically on its lifeline.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 31
50. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Fig 4.2.2 – Sequence Diagram
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 32
51. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Each class is represented by a rectangle having a subdivision of three compartments
name, attributes and operation.
4.2.3 Class Diagram
Class diagrams are the main building blocks of every object oriented methods. The
class diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams. Since classes are the building
block of an application that is based on OOPs, so as the class diagram has appropriate
structure to represent the classes, inheritance, relationships, and everything that OOPs
have in its context. It describes various kinds of objects and the static relationship in
between them.
The main purpose to use class diagrams are:
1. This is the only UML which can appropriately depict various aspects of
OOPsconcept.
2. Proper design and analysis of application can be faster and efficient.
3. It is base for deployment and component diagram.
Figure 4.2.3 Class Diagram
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 33
52. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
4.2.4 System Design
A software module is the lowest level of design granularity in the system.
Depending on the software development approach, there may be one or more modules per
system. This section should provide enough detailed information about logic and data
necessary to completely write source code for all modules in the system (and/or integrate
COTS software programs).
If there are many modules or if the module documentation is extensive, place it in an
appendix or reference a separate document. Add additional diagrams and information, if
necessary, to describe each module, its functionality, and its hierarchy. Industry-standard
module specification practices should be followed. Include the following information in
the detailed module designs:
A narrative description of each module, its function(s), the conditions under which
it is used (called or scheduled for execution), its overall processing, logic,
interfaces to other modules, interfaces to external systems, security requirements,
etc.; explain any algorithms used by the module in detail
For COTS packages, specify any call routines or bridging programs to integrate the
package with the system and/or other COTS packages (for example, Dynamic Link
Libraries)
Data elements, record structures, and file structures associated with module input
and output
Graphical representation of the module processing, logic, flow of control, and
algorithms, using an accepted diagramming approach (for example, structure
charts, action diagrams, flowcharts, etc.)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 34
53. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
Data entry and data output graphics; define or reference associated data elements;
if the project is large and complex or if the detailed module designs will be
incorporated into a separate document, then it may be appropriate to repeat the
screen information in this section
Report layout
Figure 4.2.4 System Design
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 35
54. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
4.2.5 State Chart Diagram
The name of the diagram itself clarifies the purpose of the diagram and other
details. It describes different states of a component in a system. The states are specific to
a component/object of a system.
A Statechart diagram describes a state machine. State machine can be defined as a
machine which defines different states of an object and these states are controlled by
external or internal events.
Activity diagram explained in the next chapter, is a special kind of a Statechart diagram.
As Statechart diagram defines the states, it is used to model the lifetime of an object.
4.2.5.1 How to Draw a Statechart Diagram?
Statechart diagram is used to describe the states of different objects in its life
cycle. Emphasis is placed on the state changes upon some internal or external events.
These states of objects are important to analyze and implement them accurately.
Statechart diagrams are very important for describing the states. States can be identified
as the condition of objects when a particular event occurs.
Before drawing a Statechart diagram we should clarify the following points −
Identify the important objects to be analyzed.
Identify the states.
Identify the events.
Following is an example of a Statechart diagram where the state of Order object is
analyzed
The first state is an idle state from where the process starts. The next states are arrived for
events like send request, confirm request, and dispatch order. These events are
responsible for the state changes of order object.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 36
55. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
During the life cycle of an object (here order object) it goes through the following states
and there may be some abnormal exits. This abnormal exit may occur due to some
problem in the system. When the entire life cycle is complete, it is considered as a
complete transaction as shown in the following figure. The initial and final state of an
object is also shown in the following figure.
Figure 4.2.5 Sate Chart Diagram
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 37
56. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 38
CHAPTER -5
57. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
5. IMPLEMENTATION
5.1 Pseudo Code
Step 1: Import the required packages.
Step 2: Download the dataset and link it to the google colab.
Step 3: Read the dataset and perform operations on data.
Step 4: Data cleaning.
Step 5: Data Preprocessing.
Step 6: Saving the cleaned car data set after performing operations on data.
Step 7: Start training the Machine learning Model.
Step 8: Split features and target as x and y respectively.
Step 9: Split the new data into 80% of Training data and 20% of Testing data.
Step 10: Train the model with Training data and Testing data.
Step 11: Implementing one hot encoder and column transformer to model.
Step 12: Applying Linear Regression to the model.
Step 13: Fit the Linear Regression Model.
Step 14: If accuracy is good use the model for prediction else fit the model again,
using other random states.
Step 15: Dump the Linear Regression model into our files using pickle .
Step 16: Open Pycharm and extract the cleaned car.csv and LinearRegressionModel.pkl
files into our project.
Step 17: Reading the model and dataset, make the prediction using python
and flask from webpage.
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 39
58. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
5.2 Google Collab Data set Implementation:
import pandas as pd
car=pd.read_csv("https://raw.githubusercontent.com/rajtilakls2510/car_price_predictor/m
aster/quikr_car.csv")
car.shape
(892, 6)
car.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 name 892 non-null object
1 company 892 non-null object
2 year 892 non-null object
3 Price 892 non-null object
4 kms_driven 840 non-null object
5 fuel_type 837 non-null object
dtypes: object(6)
memory usage: 41.9+ KB
car['year'].unique()
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 40
69. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
pipe=make_pipeline(column_trans,lr)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
scores.append(r2_score(y_test,y_pred))
import numpy as np
np.argmax(scores)
906
scores[np.argmax(scores)]
0.7768125045875028
#Training the model using highest r2_score
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=np.argmax(scores))
lr=LinearRegression()
pipe=make_pipeline(column_trans,lr)
pipe.fit(x_train,y_train)
y_pred=pipe.predict(x_test)
r2_score(y_test,y_pred)
0.8456515104452564
#predicting the price by taking input features
pipe.predict(pd.DataFrame([['Maruti Suzuki Swift','Maruti',2019,100,'Petrol']],
columns=['name','company','year','kms_driven','fuel_type']))
#prediction
array([459113.49353657]
# dumping the LinearRegressionModel.pkl file using pickle for further development process
import pickle
pickle.dump(pipe,open('LinearRegressionModel.pkl','wb'))
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 51
73. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
<script>
function load_car_models(company_id,car_model_id)
{
var company= document.getElementById(company_id);
var car_model= document.getElementById(car_model_id);
car_model.value="";
car_model.innerHTML="";
{% for company in companies %}
if(company.value == "{{company}}" )
{
{% for model in car_models %}
{% if company in model %}
var newOption = document.createElement("option");
newOption.value="{{ model }}";
newOption.innerHTML="{{ model }}";
car_model.options.add(newOption);
{% endif %}
{% endfor %}
}
{% endfor %}
}
function form_handler()
{
event.preventDefault();
}
function send_data()
{
document.querySelector('form').addEventListener('submit', form_handler);
var fd= new FormData(document.querySelector('form'));
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 55
74. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
var xhr=new XMLHttpRequest();
xhr.open('POST', '/predict', true);
document.getElementById("prediction").innerHTML="wait! predicting price...";
xhr.onreadystatechange= function()
{
if(xhr.readyState == XMLHttpRequest.DONE)
{
document.getElementById("prediction").innerHTML="The Predicted Price is: "+
xhr.responseText + " Rs/-";
}
}
xhr.onload=function(){};
xhr.send(fd);
}
</script>
<!-- Optional JavaScript -->
<!-- jQuery first, then Popper.js, then Bootstrap JS -->
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-
q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/popper.js@1.14.3/dist/umd/popper.min.js"
integrity="sha384-
ZMP7rVo3mIykV+2+9J3UJ46jBk0WLaUAdn689aCwoqbBJiSnjAK/l8WvCWPIPm49"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@4.1.3/dist/js/bootstrap.min.js"
integrity="sha384-
ChfqqxuZUCnJSK3+MXmPNIyE6ZbWh2IMqE241rYiqJxyMiZ6OW/JmZQ5stwEULTy"
crossorigin="anonymous"></script>
</body>
</html>
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 56
75. lOMoAR cPSD|24598226
Car Price Prediction
Downloaded by Rakesh Swain (srakeshswain005@gmail.com)
1. App.java
import pandas as pd
#from flask import Flask, render_template, request, url_for,redirect,session
import pickle
import numpy as np
from flask import *
import flask_login
import os
from num2words import num2words
import mysql.connector
model=pickle.load(open("LinearRegressionModel.pkl",'rb'))
car=pd.read_csv("cleaned car.csv")
app=Flask( name )
app.secret_key=os.urandom(24)
conn=mysql.connector.connect(
host='localhost',
user='root',
password='Password123@',
port='3306',
database='database'
)
mycursor=conn.cursor()
@app.route('/')
def login():
if 'user_id' in session:
return redirect('/home')
else:
return render_template('login.html')
@app.route('/register')
def register():
return render_template('register.html')
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY Page 57