SlideShare a Scribd company logo
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Using Machine Learning to Generate
Predictions Based on the Information
Extracted from Automobile Ads
Stere Caciandone and Costin-Gabriel CHIRU
costin.chiru@cs.pub.ro
Introduction
• Purpose: an application using ML to determine
the correct reselling price of cars based on ads
extracted from popular websites for this.
• Background: Romanian market is dominated by
previously owned cars (e.g.: Jan-Feb 2014 -
30,600 previously owned cars and only 8,770
new cars): VW 26%, Opel 18%, Ford 12%, BMW
• Application: better inform the persons involved
in the process (owners and potential buyers) if
the price reflects the cars’ value
07.09.2016 AIMSA 2016 1
Similar Approaches
• ML used for various tasks: determine the future
value of goods (especially gold, oil and stocks),
forecast the weather, predict the outcome of
different sport events, etc.
• Predict the resale value of cars:
– Chen: Investigated the market of US for predicting the
price of Toyota Corolla – linear regression – 9.2% err
– Pudaruth, 2014: vehicles from Mauritius – Multiple
Linear Regression, K-Nearest Neighbors, Decision Trees
and Naïve Bayes  small data, poor quality = bad
results
– Voß, & Lessmann, 2013: compared linear with non-
linear learning algorithms better to use random forest07.09.2016 AIMSA 2016 2
Methodology
• Extract the data related to car sale from ads posted on
autovit.ro (largest online add platform for cars resale in
Romania)
• Analyze the ads to determine the links between the price
of a car and its features using ML:
– multiple linear regression and
– random forest.
• Use feature selection in order to obtain the best
predictions.
• Show the results using a web interface, allowing the user:
– to see the car price predicted using ML (for buyers)
– to enquire the system about the price for a particular model
of car (for sellers)
– To send to his/her e-mail account the links with “good” ads
07.09.2016 AIMSA 2016 3
The Data
• Mined the content of the autovit.ro website using
Scrappy web crawler  over 16,000 car sale ads
• Cleaned the data by removing:
– ads that did not contain all the necessary data (e.g.
number of kilometers, horsepower)
– outliers (announcements about selling damaged cars for
parts or ones containing different mistypes)
• In the end the database consisted of 15,500 ads, each
entry having the following information about the car:
price, year of manufacture, mileage, horsepower,
engine capacity, fuel type, model, transmission, norm
euro, color, number of doors, the link to the page
containing the image of the car and the one used for
the sale announcement
07.09.2016 AIMSA 2016 4
Case Study – VW Passat (1)
07.09.2016 AIMSA 2016 5
• First, we needed to identify the relevant features of a
car
• Then to build the model using these features
• Finally evaluate the predictive quality of the model
• Used VW Passat (820 ads) for these steps.
• Analyzed: the age of the car, mileage, horsepower,
engine capacity, fuel type (petrol or diesel) and the
type of transmission (automatic, manual):
• Price = 16240 + Age * (-974.86) + No. km. * (-0.04) +
Horse power * 54.29 + Diesel Fuel * 887.53 +
Petrol Fuel *(-887.53) + Manual transm. * (-646.88) +
Automated transm. * 424.31
• Average accuracy was 70.9% - 10-folds cross-
validation
Case Study – VW Passat (2)
07.09.2016 AIMSA 2016 6
Case Study – VW Passat (3)
07.09.2016 AIMSA 2016 7
• Improve the model to obtain better accuracy by
feature selection and feature engineering:
– Price = 23 055 + Age * (-1108.55) + No. km. * (-0.04)
 67.8% accuracy
– Add horsepower  accuracy 71.2%
– Add transmission type: accuracy 71.4%:
Price = 15 040 + Age * (-1047.05)+ No. km. * (-0.038) +
Horsepower * 49.3 + Automated transm. * 974.07 +
Manual transm. * 19.01
• Best model: used only 4 features (age, mileage,
horsepower and type of transmission), but use
logarithm(age) and logarithm(mileage): 82.3%
Case Study – VW Passat (4)
• For model validation  apply the obtained
models on different brands / models of cars
(using again 10-folds cross-validation):
– VW Golf (945 ads): initial: 77%, improved: 86.6%;
– Opel Astra (632 ads): initial: 75.1% , improved:
81.1%
– Audi A4 (509 ads): initial: 78.2% , improved: 87%;
– Ford Focus (474 ads): initial: 69.8% , improved:
82.7%
– Skoda Octavia (398 ads): initial: 65.8% , improved:
75.1%
07.09.2016 AIMSA 2016 8
Results
07.09.2016 AIMSA 2016 9
Brand, model Average Brand, model Average Brand, model Average
Audi A3 88.5% BMW X5 78.4% Skoda Fabia 90,4%
Audi A4 87% BMW X6 72.1% Opel Astra 81,1%
Audi A5 77.3% Ford Focus 82,7% Opel Corsa 89,6%
Audi A6 84.2% Ford Fiesta 83,1% Mercedes C class 85.6%
BMW Seria 3 76.3% Ford Mondeo 78,7 Mercedes S class 82.1%
BMW Seria 5 74.2% Skoda Octavia 77,2% Renault Megane 76.2%
VW Golf 86,6% VW Touareg 83,9% Dacia Logan 72.1%
VW Passat 82,3% VW Polo 87.9% Dacia Sandero 68.1%
Random Forest Regression Algorithm – best model used all the features
Improved Multiple Linear Regression Algorithm
Brand Audi BMW Ford Dacia Mercedes VW Skoda Renault
Avg. 93.3% 92.2% 90.8% 89.3% 92.3% 93.3% 90.9% 91.2%
Tested the models for brands / models having at least 50 + 8 * m entries (82 in our case)
Conclusions
• We obtained good results by applying multiple linear
regression on car models having a large number of ads (over
200)  Prediction accuracy between 80 - 90 % depending
on the brand / model
• For the car models with a little over the minimum number
of data samples required (82 ads) was obtained a prediction
average accuracy of at least 70%.
• For the car models with less than 82 ads the algorithm did
not achieved satisfactory results confirming the formula
suggested by Tabachnick & Fidell, 2013
• Most important part was related to the choice of the
features and observing the type of relationship between the
independent and dependent features
• Predictions achieved by the random forest regression were
much better - over 90% - confirming Voß, & Lessmann, 2013
07.09.2016 AIMSA 2016 10
Questions
07.09.2016 AIMSA 2016 11
Thank you very much!

More Related Content

What's hot

Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_Regression
Sruti Jain
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease prediction
KOYELMAJUMDAR1
 
Traffic sign recognition
Traffic sign recognitionTraffic sign recognition
Traffic sign recognition
AKR Education
 
The 7 steps of Machine Learning
The 7 steps of Machine LearningThe 7 steps of Machine Learning
The 7 steps of Machine Learning
Waziri Shebogholo
 
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
Disease Prediction by Machine Learning Over Big Data From Healthcare CommunitiesDisease Prediction by Machine Learning Over Big Data From Healthcare Communities
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
Khulna University of Engineering & Tecnology
 
Driver Drowsiness Detection report
Driver Drowsiness Detection reportDriver Drowsiness Detection report
Driver Drowsiness Detection report
PurvanshJain1
 
Data Analytics Project Presentation
Data Analytics Project PresentationData Analytics Project Presentation
Data Analytics Project Presentation
Rohit Vaze
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
ankit panigrahy
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
Abhishek Singh
 
Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine Learning
Aravind Balaji
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
SUJIT SHIBAPRASAD MAITY
 
Face recognition technology
Face recognition technologyFace recognition technology
Face recognition technology
ranjit banshpal
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Detection of heart diseases by data mining
Detection of heart diseases by data miningDetection of heart diseases by data mining
Detection of heart diseases by data mining
Abheepsa Pattnaik
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approach
AshwinRachha
 
Introduction to Deep Learning
Introduction to Deep Learning Introduction to Deep Learning
Introduction to Deep Learning
Salesforce Engineering
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
Shirin Mojarad, Ph.D.
 
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
IAEME Publication
 
Restaurant Revenue Prediction using Machine Learning
Restaurant Revenue Prediction using Machine LearningRestaurant Revenue Prediction using Machine Learning
Restaurant Revenue Prediction using Machine Learning
researchinventy
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 

What's hot (20)

Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_Regression
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease prediction
 
Traffic sign recognition
Traffic sign recognitionTraffic sign recognition
Traffic sign recognition
 
The 7 steps of Machine Learning
The 7 steps of Machine LearningThe 7 steps of Machine Learning
The 7 steps of Machine Learning
 
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
Disease Prediction by Machine Learning Over Big Data From Healthcare CommunitiesDisease Prediction by Machine Learning Over Big Data From Healthcare Communities
Disease Prediction by Machine Learning Over Big Data From Healthcare Communities
 
Driver Drowsiness Detection report
Driver Drowsiness Detection reportDriver Drowsiness Detection report
Driver Drowsiness Detection report
 
Data Analytics Project Presentation
Data Analytics Project PresentationData Analytics Project Presentation
Data Analytics Project Presentation
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Stock Market Prediction using Machine Learning
Stock Market Prediction using Machine LearningStock Market Prediction using Machine Learning
Stock Market Prediction using Machine Learning
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
Face recognition technology
Face recognition technologyFace recognition technology
Face recognition technology
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Detection of heart diseases by data mining
Detection of heart diseases by data miningDetection of heart diseases by data mining
Detection of heart diseases by data mining
 
Facial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approachFacial Emotion Recognition: A Deep Learning approach
Facial Emotion Recognition: A Deep Learning approach
 
Introduction to Deep Learning
Introduction to Deep Learning Introduction to Deep Learning
Introduction to Deep Learning
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESPREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
 
Restaurant Revenue Prediction using Machine Learning
Restaurant Revenue Prediction using Machine LearningRestaurant Revenue Prediction using Machine Learning
Restaurant Revenue Prediction using Machine Learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 

Similar to Using machine learning to generate predictions based on the information extracted from automobile ads

B.Tech Project Presentation for Students
B.Tech Project Presentation for StudentsB.Tech Project Presentation for Students
B.Tech Project Presentation for Students
Shivam Verma
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
Tomáš Drenčák
 
Intro to Marketing Communications - BMW
Intro to Marketing Communications - BMWIntro to Marketing Communications - BMW
Intro to Marketing Communications - BMW
Hoang Le
 
ETALAB-ACCENTURE PROJECT
ETALAB-ACCENTURE PROJECTETALAB-ACCENTURE PROJECT
ETALAB-ACCENTURE PROJECT
Ojas Karandikar
 
Automotive-new
Automotive-newAutomotive-new
Automotive-new
Marija Nasevska
 
UNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
UNIQUE CUSTOMER EXPERIENCE OF ADAS/ADUNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
UNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
iQHub
 
TESLA Group 7 Final
TESLA Group 7 FinalTESLA Group 7 Final
TESLA Group 7 Final
Mark Moskvitine
 
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
Yonki Hyungkeun PARK
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine Learning
IRJET Journal
 
Audi smart drive
Audi smart driveAudi smart drive
Audi smart drive
Anish Lamba
 
CARVANA - Predicting the purchase quality in car
CARVANA - Predicting the  purchase quality in carCARVANA - Predicting the  purchase quality in car
CARVANA - Predicting the purchase quality in car
ShankarPrasaadRajama
 
The ODB System
The ODB SystemThe ODB System
The ODB System
CLT Valuebased Services
 
Europe EV Powertrain Testing Services Market 2026
Europe EV Powertrain Testing Services Market 2026Europe EV Powertrain Testing Services Market 2026
Europe EV Powertrain Testing Services Market 2026
TechSci Research
 
Audi at ces 2016 press release
Audi at ces 2016 press releaseAudi at ces 2016 press release
Audi at ces 2016 press release
RushLane
 
Presentation Driving Safety
Presentation Driving SafetyPresentation Driving Safety
Presentation Driving Safety
SCOUT Group of Companies
 
Strategic Analysis of Global Low-cost Truck Maket: A Brief Summary
Strategic Analysis of Global Low-cost Truck Maket: A Brief SummaryStrategic Analysis of Global Low-cost Truck Maket: A Brief Summary
Strategic Analysis of Global Low-cost Truck Maket: A Brief Summary
Sandeep Kar
 
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdfOliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
ssuser075877
 
2013 Simulated Car Racing @ GECCO-2013
2013 Simulated Car Racing @ GECCO-20132013 Simulated Car Racing @ GECCO-2013
2013 Simulated Car Racing @ GECCO-2013
Daniele Loiacono
 
Darden School of Business Tesla Strategic Analysis
Darden School of Business   Tesla Strategic AnalysisDarden School of Business   Tesla Strategic Analysis
Darden School of Business Tesla Strategic Analysis
José Ángel Álvarez Fuente
 
Brief Summary_ND32_18
Brief Summary_ND32_18Brief Summary_ND32_18
Brief Summary_ND32_18
Sandeep Kar
 

Similar to Using machine learning to generate predictions based on the information extracted from automobile ads (20)

B.Tech Project Presentation for Students
B.Tech Project Presentation for StudentsB.Tech Project Presentation for Students
B.Tech Project Presentation for Students
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
 
Intro to Marketing Communications - BMW
Intro to Marketing Communications - BMWIntro to Marketing Communications - BMW
Intro to Marketing Communications - BMW
 
ETALAB-ACCENTURE PROJECT
ETALAB-ACCENTURE PROJECTETALAB-ACCENTURE PROJECT
ETALAB-ACCENTURE PROJECT
 
Automotive-new
Automotive-newAutomotive-new
Automotive-new
 
UNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
UNIQUE CUSTOMER EXPERIENCE OF ADAS/ADUNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
UNIQUE CUSTOMER EXPERIENCE OF ADAS/AD
 
TESLA Group 7 Final
TESLA Group 7 FinalTESLA Group 7 Final
TESLA Group 7 Final
 
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
"The Future of the Automotive Industry", Automotive Session, POSCO EVI Forum
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine Learning
 
Audi smart drive
Audi smart driveAudi smart drive
Audi smart drive
 
CARVANA - Predicting the purchase quality in car
CARVANA - Predicting the  purchase quality in carCARVANA - Predicting the  purchase quality in car
CARVANA - Predicting the purchase quality in car
 
The ODB System
The ODB SystemThe ODB System
The ODB System
 
Europe EV Powertrain Testing Services Market 2026
Europe EV Powertrain Testing Services Market 2026Europe EV Powertrain Testing Services Market 2026
Europe EV Powertrain Testing Services Market 2026
 
Audi at ces 2016 press release
Audi at ces 2016 press releaseAudi at ces 2016 press release
Audi at ces 2016 press release
 
Presentation Driving Safety
Presentation Driving SafetyPresentation Driving Safety
Presentation Driving Safety
 
Strategic Analysis of Global Low-cost Truck Maket: A Brief Summary
Strategic Analysis of Global Low-cost Truck Maket: A Brief SummaryStrategic Analysis of Global Low-cost Truck Maket: A Brief Summary
Strategic Analysis of Global Low-cost Truck Maket: A Brief Summary
 
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdfOliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
Oliver Wyman - FUTURE AUTOMOTIVE INDUSTRY STRUCTURE UNTIL 2030.pdf
 
2013 Simulated Car Racing @ GECCO-2013
2013 Simulated Car Racing @ GECCO-20132013 Simulated Car Racing @ GECCO-2013
2013 Simulated Car Racing @ GECCO-2013
 
Darden School of Business Tesla Strategic Analysis
Darden School of Business   Tesla Strategic AnalysisDarden School of Business   Tesla Strategic Analysis
Darden School of Business Tesla Strategic Analysis
 
Brief Summary_ND32_18
Brief Summary_ND32_18Brief Summary_ND32_18
Brief Summary_ND32_18
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
University Politehnica Bucharest
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
University Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
University Politehnica Bucharest
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
University Politehnica Bucharest
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
University Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
University Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
University Politehnica Bucharest
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
University Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
University Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
University Politehnica Bucharest
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
University Politehnica Bucharest
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
University Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
University Politehnica Bucharest
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
University Politehnica Bucharest
 
Metaphor detection
Metaphor detectionMetaphor detection
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 

Using machine learning to generate predictions based on the information extracted from automobile ads

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Using Machine Learning to Generate Predictions Based on the Information Extracted from Automobile Ads Stere Caciandone and Costin-Gabriel CHIRU costin.chiru@cs.pub.ro
  • 2. Introduction • Purpose: an application using ML to determine the correct reselling price of cars based on ads extracted from popular websites for this. • Background: Romanian market is dominated by previously owned cars (e.g.: Jan-Feb 2014 - 30,600 previously owned cars and only 8,770 new cars): VW 26%, Opel 18%, Ford 12%, BMW • Application: better inform the persons involved in the process (owners and potential buyers) if the price reflects the cars’ value 07.09.2016 AIMSA 2016 1
  • 3. Similar Approaches • ML used for various tasks: determine the future value of goods (especially gold, oil and stocks), forecast the weather, predict the outcome of different sport events, etc. • Predict the resale value of cars: – Chen: Investigated the market of US for predicting the price of Toyota Corolla – linear regression – 9.2% err – Pudaruth, 2014: vehicles from Mauritius – Multiple Linear Regression, K-Nearest Neighbors, Decision Trees and Naïve Bayes  small data, poor quality = bad results – Voß, & Lessmann, 2013: compared linear with non- linear learning algorithms better to use random forest07.09.2016 AIMSA 2016 2
  • 4. Methodology • Extract the data related to car sale from ads posted on autovit.ro (largest online add platform for cars resale in Romania) • Analyze the ads to determine the links between the price of a car and its features using ML: – multiple linear regression and – random forest. • Use feature selection in order to obtain the best predictions. • Show the results using a web interface, allowing the user: – to see the car price predicted using ML (for buyers) – to enquire the system about the price for a particular model of car (for sellers) – To send to his/her e-mail account the links with “good” ads 07.09.2016 AIMSA 2016 3
  • 5. The Data • Mined the content of the autovit.ro website using Scrappy web crawler  over 16,000 car sale ads • Cleaned the data by removing: – ads that did not contain all the necessary data (e.g. number of kilometers, horsepower) – outliers (announcements about selling damaged cars for parts or ones containing different mistypes) • In the end the database consisted of 15,500 ads, each entry having the following information about the car: price, year of manufacture, mileage, horsepower, engine capacity, fuel type, model, transmission, norm euro, color, number of doors, the link to the page containing the image of the car and the one used for the sale announcement 07.09.2016 AIMSA 2016 4
  • 6. Case Study – VW Passat (1) 07.09.2016 AIMSA 2016 5 • First, we needed to identify the relevant features of a car • Then to build the model using these features • Finally evaluate the predictive quality of the model • Used VW Passat (820 ads) for these steps. • Analyzed: the age of the car, mileage, horsepower, engine capacity, fuel type (petrol or diesel) and the type of transmission (automatic, manual): • Price = 16240 + Age * (-974.86) + No. km. * (-0.04) + Horse power * 54.29 + Diesel Fuel * 887.53 + Petrol Fuel *(-887.53) + Manual transm. * (-646.88) + Automated transm. * 424.31 • Average accuracy was 70.9% - 10-folds cross- validation
  • 7. Case Study – VW Passat (2) 07.09.2016 AIMSA 2016 6
  • 8. Case Study – VW Passat (3) 07.09.2016 AIMSA 2016 7 • Improve the model to obtain better accuracy by feature selection and feature engineering: – Price = 23 055 + Age * (-1108.55) + No. km. * (-0.04)  67.8% accuracy – Add horsepower  accuracy 71.2% – Add transmission type: accuracy 71.4%: Price = 15 040 + Age * (-1047.05)+ No. km. * (-0.038) + Horsepower * 49.3 + Automated transm. * 974.07 + Manual transm. * 19.01 • Best model: used only 4 features (age, mileage, horsepower and type of transmission), but use logarithm(age) and logarithm(mileage): 82.3%
  • 9. Case Study – VW Passat (4) • For model validation  apply the obtained models on different brands / models of cars (using again 10-folds cross-validation): – VW Golf (945 ads): initial: 77%, improved: 86.6%; – Opel Astra (632 ads): initial: 75.1% , improved: 81.1% – Audi A4 (509 ads): initial: 78.2% , improved: 87%; – Ford Focus (474 ads): initial: 69.8% , improved: 82.7% – Skoda Octavia (398 ads): initial: 65.8% , improved: 75.1% 07.09.2016 AIMSA 2016 8
  • 10. Results 07.09.2016 AIMSA 2016 9 Brand, model Average Brand, model Average Brand, model Average Audi A3 88.5% BMW X5 78.4% Skoda Fabia 90,4% Audi A4 87% BMW X6 72.1% Opel Astra 81,1% Audi A5 77.3% Ford Focus 82,7% Opel Corsa 89,6% Audi A6 84.2% Ford Fiesta 83,1% Mercedes C class 85.6% BMW Seria 3 76.3% Ford Mondeo 78,7 Mercedes S class 82.1% BMW Seria 5 74.2% Skoda Octavia 77,2% Renault Megane 76.2% VW Golf 86,6% VW Touareg 83,9% Dacia Logan 72.1% VW Passat 82,3% VW Polo 87.9% Dacia Sandero 68.1% Random Forest Regression Algorithm – best model used all the features Improved Multiple Linear Regression Algorithm Brand Audi BMW Ford Dacia Mercedes VW Skoda Renault Avg. 93.3% 92.2% 90.8% 89.3% 92.3% 93.3% 90.9% 91.2% Tested the models for brands / models having at least 50 + 8 * m entries (82 in our case)
  • 11. Conclusions • We obtained good results by applying multiple linear regression on car models having a large number of ads (over 200)  Prediction accuracy between 80 - 90 % depending on the brand / model • For the car models with a little over the minimum number of data samples required (82 ads) was obtained a prediction average accuracy of at least 70%. • For the car models with less than 82 ads the algorithm did not achieved satisfactory results confirming the formula suggested by Tabachnick & Fidell, 2013 • Most important part was related to the choice of the features and observing the type of relationship between the independent and dependent features • Predictions achieved by the random forest regression were much better - over 90% - confirming Voß, & Lessmann, 2013 07.09.2016 AIMSA 2016 10
  • 12. Questions 07.09.2016 AIMSA 2016 11 Thank you very much!

Editor's Notes

  1. Tabachnick & Fidell, 2013