SlideShare a Scribd company logo
Big Data:
Predicting Rent in London by
Machine Learning
Manabu Mukohyoshi
Motivation
• Interested in Machine Learning
• Wide range of Machine Learning applications
in use
• Data-driven cities: City slicker - Data are slowly
changing the way cities operate (The
Economist)
Initial ideas
• Predict fires to dispatch ambulances efficiently
• Predict crimes to dispatch police cars efficiently
• Predict energy consumption (gas, electricity, etc.)
• Predict increase of waste using population
• Predict emission of carbon dioxide
• Predict the rise of rents and house prices using
economics and population data
• Map Londoners’ health on to the map of London
• Predict happiness by region
• Predict congestion
Number of Fires by Ward
Number of Fires by Borough
Number of Fires by hour
# of Fires
First Arrival Time
First Arrival Time and Fire Stations
Initial ideas
• London Datastore has a variety of data
– Mostly statistics
– Not a lot of individual data
• What to learn?
The Idea
• Rent Prediction in London by Machine
Learning
– Can retrieve individual rent data from Zoopla
– Rent keeps changing and it is hard to know if the
rent is right for the place
• For landlords, it can be a standard to decide rent
• For tenants, it can be a standard to judge rent
• For Zoopla, it can attract more customers
Data Source
• Zoopla (about 45,000 examples)
– Latitude, Longitude, # of bedrooms, # of bathrooms, #
of floors, # of receptions, property type, price
• Walkscore
– Calculate score of an address based on how walkable
it is. (Close to grocery stores, restaurants, cafes, etc…)
• MapIt
– Converting Latitude/Longitude to Ward and Borough
code
Data Source
• London Datastore
– Ward profile
• Mean Age, Population density, % Not Born in UK, General Fertility
Rate, Male life expectancy, Female life expectancy, % children in
year 6 who are obese, Rate of All Ambulance Incidents per 1,000
population, Employment rate (16-74), Median House Price,
Number of properties sold, % Households Social Rented, %
Households Private Rented, % dwellings in council tax bands A or
B, % dwellings in council tax bands C, D or E, % dwellings in
council tax bands F, G or H, Claimant Rate of Income Support, %
with no qualifications, % with Level 4 qualifications and above,
Crime rate, Deliberate Fires, Cars per household, Average Public
Transport Accessibility score, Turnout at Mayoral election - 2012
– Borough profile
• Total carbon emissions, Teenage conception rate, Life satisfaction
score, Worthwhileness score, Happiness score, Anxiety score
Steps to solve
1. Collect and combine data
2. Preprocess data
3. Try different algorithms of machine learning
on the collected data
4. Tune the parameters of ML algorithms
5. Evaluate the results and algorithms
Step 1: Collect and Combine Data
1. Download listings data using Zoopla API
2. Get Walkscore using the API
3. Convert Longitude/Latitude to ward and
borough code using self-hosted MapIt
4. Merge ward and borough profile downloaded
from London Datastore to listings data
MapIt: UK
Step 2: Preprocess Data
• Scale (bias elimination)
• Encode categorical features
• Impute
– Replace n/a or space with mean
• Shuffle
• Split into training dataset and test dataset
(cross validation)
Step 3: Try Different Algorithms
name Average MSE
1.11.2.1. Random Forests 0.241214063
1.11.4. Gradient Tree Boosting 0.273875445
1.11.1. Bagging meta-estimator 0.296172365
1.11.2.2. Extremely Randomized Trees 0.296710726
1.6.3. KNeighborsRegressor uniform 0.306133182
1.6.3. KNeighborsRegressor distance 0.319488307
1.10. DecisionTreeRegressor 0.336486662
1.10. ExtraTreeRegressor 0.40337387
1.4.2 SVR poly 0.429585937
1.4.2 NuSVR poly 0.434766842
1.11.3. AdaBoost 0.443524744
1.4.2 SVR rbf 0.476364995
1.1.9.1. Bayesian Ridge Regression 0.567228078
1.1.4. Elastic Net 0.56727658
1.1.2. Ridge Regression 0.567611415
name Average MSE
1.1.1. Ordinary Least Squares 0.567641956
1.1.11. Stochastic Gradient Descent 0.573168168
1.1.8. Orthogonal Matching Pursuit 0.576630178
1.1.14.3. Theil-Sen estimator 0.5875179
1.4.2 SVR linear 0.642531415
1.4.2 LinearSVR 0.667162534
1.1.14.2. RANSAC 0.705499997
1.1.13. Passive Aggressive Algorithms 0.726516853
1.1.3. Lasso 0.899948627
1.1.7. LARS Lasso 0.899948627
1.4.2 SVR sigmoid 0.937398784
1.8. Cross decomposition PLSRegression 1.662293485
1.6.3. NearestCentroid 1.701974047
1.8. Cross decomposition PLSCanonical 10.72550448
Step 4: Tune Parameters of Algs.
• Grid Search
– Exhaustively search the possible combinations of
parameters
– Takes too much time on my computer
• Random Search
– Takes less time
– Result is similar to grid search
Let’s see tuning parameters…
Support Vector Regression
KNN
Step 5: Evaluate
• Feature Importance
• Final MSE for 4 selected algorithms
• Compare rents with Zoopla Estimate
Feature Importance:
Random Forest
Feature Importance:
GBR
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 3 4 5 6 7 8 9 10 new data
MSE
Cross Validation / Score on new data
MSE on Cross Validation and new listings data
KNN
GBR
RF
SVR
standard deviation
Final Result
Final Result (MSE)MSE from Step 3
Fitting Time
(42003 examples)
Predicting Time
(3582 examples)
Random Forest 0.108435602 0.241214063 53.12 sec 2.03 sec
Gradient Tree
Boosting 0.117256254 0.273875445 149.18 sec 0.45 sec
Support Vector
Machine 0.143577993 0.429585937 3192.02 sec 4.54 sec
K-Nearest
Neighbors 0.217186025 0.306133182 3.97 sec 3.82 sec
Actual rent and predicted rent
(Random Forest)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1
85
169
253
337
421
505
589
673
757
841
925
1009
1093
1177
1261
1345
1429
1513
1597
1681
1765
1849
1933
2017
2101
2185
2269
2353
2437
2521
2605
2689
2773
2857
2941
3025
3109
3193
3277
3361
3445
3529
predicted
actual
Rent
(£)
Compare rents with Zoopla Estimate (1/2)
Zoopla Estimate
Actual Rent
Predicted Rent by Random Forests
£381.5120443 pw = £1653 pcm
(pw x 52 / 12 = pcm)
Compare rents with Zoopla Estimate (2/2)
Zoopla Estimate
Actual Rent
Predicted Rent by Random Forests
£1488.237929 pw = £6449 pcm
Conclusion
• Random Forest works the best for this
problem
• Data quality in dataset greatly influence the
result of prediction more than parameters of
machine learning algorithms does
• Can not compare all the predicted rents with
Zoopla estimate, but got some results closer
to the actual rents than Zoopla estimate
Future Work
• Adding more room specific information such
as size of the room and age
• Make an app to predict rent by inputting an
address, # of bedrooms, # of bathrooms, # of
floors and property type
Challenges
• Collect Data
– Time consuming
– Hard to find good dataset
• Statistics
– Possible to use machine learning without knowing
math/statistics
– Need to know in order to understand what ML
algorithms do deeply or tune the parameters
efficiently
What I learned
• Python
• Scikit-learn / Tableau / Google Maps API /
Walkscore API / Coordinate systems (MapIt
API)
• How to apply machine learning algorithms
• Collecting good dataset is more important
than algorithms
References
• Walkscore
– https://www.walkscore.com
• MapIt
– http://mapit.poplus.org
• Google Maps API
– https://developers.google.com/maps/documentation/javascript/
• Scikit-learn
– http://scikit-learn.org/stable/
• London Datastore
– http://data.london.gov.uk
• Tableau
– http://www.tableau.com
References
• Zoopla
– http://www.zoopla.co.uk
– Examples from Zoopla
• http://www.zoopla.co.uk/property/101-greyhound-
road/london/n17-6xr/15262720
• http://www.zoopla.co.uk/to-
rent/details/36920785#5yJdKDM4BovT5eu6.97
• http://www.zoopla.co.uk/property/28-cato-
street/london/w1h-5jj/28909969
• http://www.zoopla.co.uk/to-
rent/details/37005409?search_identifier=0f64a06eeb79864
7935af065dcaf87c4#V6Xmr062sEqY198c.97
References
• Data-driven cities: City slicker - Data are slowly changing
the way cities operate (The Economist)
– http://www.economist.com/news/britain/21629533-data-are-
slowly-changing-way-cities-operate-city-slicker
• CS7641.TNL.MATLAB. Supervised Learning Workflow and
Algorithms
– http://wiki.omscs.org/confluence/display/CS7641ML/CS7641.T
NL.MATLAB.+Supervised+Learning+Workflow+and+Algorithms
• Coursera: Machine Learning by Andrew Ng
– https://www.coursera.org/course/ml
• Questions?
MSE
• The RMSE is the distance, on average, of a
data point from the fitted line (representing
predictions made by the model), measured
along a vertical line.
Cross Validation
https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
Random Forests
http://provectus.com/blog/news/research_paper_for_load_forecast
Gradient Tree Boosting
http://provectus.com/blog/news/research_paper_for_load_forecast
Support Vector Machine
http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html
K-Nearest Neighbors
http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/
What is Machine Learning?
• Supervised learning
– Classification
– Regression
Fitting/Trai
ning
Predicting
# of bedrooms, lat/long
rent
# of bedrooms, lat/long
Predicted Rent

More Related Content

Similar to FinalPresentation-GradProject

Opportunities for alternative data sources
Opportunities for alternative data sourcesOpportunities for alternative data sources
Opportunities for alternative data sources
Office for National Statistics
 
Ian perry dutch smart metering forum
Ian perry   dutch smart metering forumIan perry   dutch smart metering forum
Ian perry dutch smart metering forum
Dutch Power
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
Esri España
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
jins0618
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
Office for National Statistics
 
CI_SIModule_QGIS.pptx .
CI_SIModule_QGIS.pptx                         .CI_SIModule_QGIS.pptx                         .
CI_SIModule_QGIS.pptx .
Athar739197
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014
iotisrael
 
Esriuk_track8_university_of_sheffield
Esriuk_track8_university_of_sheffieldEsriuk_track8_university_of_sheffield
Esriuk_track8_university_of_sheffield
Esri UK
 
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning AlgorithmsPredictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
sushantparte
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Gloria Re Calegari
 
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
Wassim Derguech
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
PriyadharshiniG41
 
A Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura WynterA Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura Wynter
wkwsci-research
 
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
InfinIT - Innovationsnetværket for it
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
Melinda Thielbar
 
Human factor in big data qrowd bdve
Human factor in big data qrowd bdveHuman factor in big data qrowd bdve
Human factor in big data qrowd bdve
Luis Daniel Ibáñez
 
Gribb integration of planning documents into a spatial decision
Gribb integration of planning documents into a spatial decisionGribb integration of planning documents into a spatial decision
Gribb integration of planning documents into a spatial decisionGeCo in the Rockies
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data
Sharjeel Imtiaz
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 

Similar to FinalPresentation-GradProject (20)

Opportunities for alternative data sources
Opportunities for alternative data sourcesOpportunities for alternative data sources
Opportunities for alternative data sources
 
Ian perry dutch smart metering forum
Ian perry   dutch smart metering forumIan perry   dutch smart metering forum
Ian perry dutch smart metering forum
 
La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...La telefonía móvil como fuente de información para el estudio de la movilidad...
La telefonía móvil como fuente de información para el estudio de la movilidad...
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
 
CI_SIModule_QGIS.pptx .
CI_SIModule_QGIS.pptx                         .CI_SIModule_QGIS.pptx                         .
CI_SIModule_QGIS.pptx .
 
Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014Bringing iot data to life, IoT Israel 2014
Bringing iot data to life, IoT Israel 2014
 
Esriuk_track8_university_of_sheffield
Esriuk_track8_university_of_sheffieldEsriuk_track8_university_of_sheffield
Esriuk_track8_university_of_sheffield
 
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning AlgorithmsPredictive Analysis of Bike Sharing System Using Machine Learning Algorithms
Predictive Analysis of Bike Sharing System Using Machine Learning Algorithms
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
 
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
An Autonomic Approach to Real-Time Predictive Analytics using Open Data and ...
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
A Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura WynterA Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura Wynter
 
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
Extracting Value from Big Data - The Case Vehicular Traffic Data by Christian...
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Human factor in big data qrowd bdve
Human factor in big data qrowd bdveHuman factor in big data qrowd bdve
Human factor in big data qrowd bdve
 
Gribb integration of planning documents into a spatial decision
Gribb integration of planning documents into a spatial decisionGribb integration of planning documents into a spatial decision
Gribb integration of planning documents into a spatial decision
 
Big Data Analytics and Open Data
Big Data Analytics and Open Data Big Data Analytics and Open Data
Big Data Analytics and Open Data
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 

FinalPresentation-GradProject

  • 1. Big Data: Predicting Rent in London by Machine Learning Manabu Mukohyoshi
  • 2. Motivation • Interested in Machine Learning • Wide range of Machine Learning applications in use • Data-driven cities: City slicker - Data are slowly changing the way cities operate (The Economist)
  • 3. Initial ideas • Predict fires to dispatch ambulances efficiently • Predict crimes to dispatch police cars efficiently • Predict energy consumption (gas, electricity, etc.) • Predict increase of waste using population • Predict emission of carbon dioxide • Predict the rise of rents and house prices using economics and population data • Map Londoners’ health on to the map of London • Predict happiness by region • Predict congestion
  • 4. Number of Fires by Ward
  • 5. Number of Fires by Borough
  • 6. Number of Fires by hour
  • 9. First Arrival Time and Fire Stations
  • 10. Initial ideas • London Datastore has a variety of data – Mostly statistics – Not a lot of individual data • What to learn?
  • 11. The Idea • Rent Prediction in London by Machine Learning – Can retrieve individual rent data from Zoopla – Rent keeps changing and it is hard to know if the rent is right for the place • For landlords, it can be a standard to decide rent • For tenants, it can be a standard to judge rent • For Zoopla, it can attract more customers
  • 12. Data Source • Zoopla (about 45,000 examples) – Latitude, Longitude, # of bedrooms, # of bathrooms, # of floors, # of receptions, property type, price • Walkscore – Calculate score of an address based on how walkable it is. (Close to grocery stores, restaurants, cafes, etc…) • MapIt – Converting Latitude/Longitude to Ward and Borough code
  • 13. Data Source • London Datastore – Ward profile • Mean Age, Population density, % Not Born in UK, General Fertility Rate, Male life expectancy, Female life expectancy, % children in year 6 who are obese, Rate of All Ambulance Incidents per 1,000 population, Employment rate (16-74), Median House Price, Number of properties sold, % Households Social Rented, % Households Private Rented, % dwellings in council tax bands A or B, % dwellings in council tax bands C, D or E, % dwellings in council tax bands F, G or H, Claimant Rate of Income Support, % with no qualifications, % with Level 4 qualifications and above, Crime rate, Deliberate Fires, Cars per household, Average Public Transport Accessibility score, Turnout at Mayoral election - 2012 – Borough profile • Total carbon emissions, Teenage conception rate, Life satisfaction score, Worthwhileness score, Happiness score, Anxiety score
  • 14. Steps to solve 1. Collect and combine data 2. Preprocess data 3. Try different algorithms of machine learning on the collected data 4. Tune the parameters of ML algorithms 5. Evaluate the results and algorithms
  • 15. Step 1: Collect and Combine Data 1. Download listings data using Zoopla API 2. Get Walkscore using the API 3. Convert Longitude/Latitude to ward and borough code using self-hosted MapIt 4. Merge ward and borough profile downloaded from London Datastore to listings data MapIt: UK
  • 16. Step 2: Preprocess Data • Scale (bias elimination) • Encode categorical features • Impute – Replace n/a or space with mean • Shuffle • Split into training dataset and test dataset (cross validation)
  • 17. Step 3: Try Different Algorithms name Average MSE 1.11.2.1. Random Forests 0.241214063 1.11.4. Gradient Tree Boosting 0.273875445 1.11.1. Bagging meta-estimator 0.296172365 1.11.2.2. Extremely Randomized Trees 0.296710726 1.6.3. KNeighborsRegressor uniform 0.306133182 1.6.3. KNeighborsRegressor distance 0.319488307 1.10. DecisionTreeRegressor 0.336486662 1.10. ExtraTreeRegressor 0.40337387 1.4.2 SVR poly 0.429585937 1.4.2 NuSVR poly 0.434766842 1.11.3. AdaBoost 0.443524744 1.4.2 SVR rbf 0.476364995 1.1.9.1. Bayesian Ridge Regression 0.567228078 1.1.4. Elastic Net 0.56727658 1.1.2. Ridge Regression 0.567611415 name Average MSE 1.1.1. Ordinary Least Squares 0.567641956 1.1.11. Stochastic Gradient Descent 0.573168168 1.1.8. Orthogonal Matching Pursuit 0.576630178 1.1.14.3. Theil-Sen estimator 0.5875179 1.4.2 SVR linear 0.642531415 1.4.2 LinearSVR 0.667162534 1.1.14.2. RANSAC 0.705499997 1.1.13. Passive Aggressive Algorithms 0.726516853 1.1.3. Lasso 0.899948627 1.1.7. LARS Lasso 0.899948627 1.4.2 SVR sigmoid 0.937398784 1.8. Cross decomposition PLSRegression 1.662293485 1.6.3. NearestCentroid 1.701974047 1.8. Cross decomposition PLSCanonical 10.72550448
  • 18. Step 4: Tune Parameters of Algs. • Grid Search – Exhaustively search the possible combinations of parameters – Takes too much time on my computer • Random Search – Takes less time – Result is similar to grid search Let’s see tuning parameters…
  • 20.
  • 21. KNN
  • 22. Step 5: Evaluate • Feature Importance • Final MSE for 4 selected algorithms • Compare rents with Zoopla Estimate
  • 25. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 3 4 5 6 7 8 9 10 new data MSE Cross Validation / Score on new data MSE on Cross Validation and new listings data KNN GBR RF SVR standard deviation
  • 26. Final Result Final Result (MSE)MSE from Step 3 Fitting Time (42003 examples) Predicting Time (3582 examples) Random Forest 0.108435602 0.241214063 53.12 sec 2.03 sec Gradient Tree Boosting 0.117256254 0.273875445 149.18 sec 0.45 sec Support Vector Machine 0.143577993 0.429585937 3192.02 sec 4.54 sec K-Nearest Neighbors 0.217186025 0.306133182 3.97 sec 3.82 sec
  • 27. Actual rent and predicted rent (Random Forest) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 85 169 253 337 421 505 589 673 757 841 925 1009 1093 1177 1261 1345 1429 1513 1597 1681 1765 1849 1933 2017 2101 2185 2269 2353 2437 2521 2605 2689 2773 2857 2941 3025 3109 3193 3277 3361 3445 3529 predicted actual Rent (£)
  • 28. Compare rents with Zoopla Estimate (1/2) Zoopla Estimate Actual Rent Predicted Rent by Random Forests £381.5120443 pw = £1653 pcm (pw x 52 / 12 = pcm)
  • 29. Compare rents with Zoopla Estimate (2/2) Zoopla Estimate Actual Rent Predicted Rent by Random Forests £1488.237929 pw = £6449 pcm
  • 30. Conclusion • Random Forest works the best for this problem • Data quality in dataset greatly influence the result of prediction more than parameters of machine learning algorithms does • Can not compare all the predicted rents with Zoopla estimate, but got some results closer to the actual rents than Zoopla estimate
  • 31. Future Work • Adding more room specific information such as size of the room and age • Make an app to predict rent by inputting an address, # of bedrooms, # of bathrooms, # of floors and property type
  • 32. Challenges • Collect Data – Time consuming – Hard to find good dataset • Statistics – Possible to use machine learning without knowing math/statistics – Need to know in order to understand what ML algorithms do deeply or tune the parameters efficiently
  • 33. What I learned • Python • Scikit-learn / Tableau / Google Maps API / Walkscore API / Coordinate systems (MapIt API) • How to apply machine learning algorithms • Collecting good dataset is more important than algorithms
  • 34. References • Walkscore – https://www.walkscore.com • MapIt – http://mapit.poplus.org • Google Maps API – https://developers.google.com/maps/documentation/javascript/ • Scikit-learn – http://scikit-learn.org/stable/ • London Datastore – http://data.london.gov.uk • Tableau – http://www.tableau.com
  • 35. References • Zoopla – http://www.zoopla.co.uk – Examples from Zoopla • http://www.zoopla.co.uk/property/101-greyhound- road/london/n17-6xr/15262720 • http://www.zoopla.co.uk/to- rent/details/36920785#5yJdKDM4BovT5eu6.97 • http://www.zoopla.co.uk/property/28-cato- street/london/w1h-5jj/28909969 • http://www.zoopla.co.uk/to- rent/details/37005409?search_identifier=0f64a06eeb79864 7935af065dcaf87c4#V6Xmr062sEqY198c.97
  • 36. References • Data-driven cities: City slicker - Data are slowly changing the way cities operate (The Economist) – http://www.economist.com/news/britain/21629533-data-are- slowly-changing-way-cities-operate-city-slicker • CS7641.TNL.MATLAB. Supervised Learning Workflow and Algorithms – http://wiki.omscs.org/confluence/display/CS7641ML/CS7641.T NL.MATLAB.+Supervised+Learning+Workflow+and+Algorithms • Coursera: Machine Learning by Andrew Ng – https://www.coursera.org/course/ml • Questions?
  • 37. MSE • The RMSE is the distance, on average, of a data point from the fitted line (representing predictions made by the model), measured along a vertical line.
  • 43. What is Machine Learning? • Supervised learning – Classification – Regression Fitting/Trai ning Predicting # of bedrooms, lat/long rent # of bedrooms, lat/long Predicted Rent

Editor's Notes

  1. Red is higher green is lower
  2. What is walkscore
  3. Sorting with MSE Predicting rent Lower is better Duplicate the same kind of algorithms
  4. Degree as polynomial
  5. Show other graphs of K
  6. Walkscore is useful London profile data is not so usefle because it is incorporated in lat/long
  7. Highlight the new data