Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
THE HACK ON JERSEY CITY CONDO PRICES
explore trends in public data
Yiqun “Yi” Wang / NYC Data Science Academy / Code for J...
THE HACK ON JERSEY CITY CONDO PRICES
Outline of the Project
• Data
• Tax assessment data
• Third party data sources to joi...
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
NJ MOD IV System
Covers all individual properties
Downloadable in...
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data
• 62,270 totals property records, as of Feb 2015
-- filter down t...
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data is Dirty!
# address cleanup
taxdata$Property.Location <- gsub("ST...
THE HACK ON JERSEY CITY CONDO PRICES
Tax Assessment Data has Hidden Treasures
• From “Qual” we can parse out floor number ...
THE HACK ON JERSEY CITY CONDO PRICES
Third Party Data Sources To Join
• Condo building attributes:
• http://livingonthehud...
THE HACK ON JERSEY CITY CONDO PRICES
Map Out All the Buildings
# step 1.4.3 map the buildings out
bldmaap <- ggmap(get_goo...
THE HACK ON JERSEY CITY CONDO PRICES
Building Attributes
For each condo building:
• Address / Lat / Lon
• Number of Units
...
THE HACK ON JERSEY CITY CONDO PRICES
Building Data Table
THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
# step 1.6 come up with a building primene...
THE HACK ON JERSEY CITY CONDO PRICES
Building Scoring System – PLSR is superior
THE HACK ON JERSEY CITY CONDO PRICES
Some Cross Checking on Buildings
BuildingName calc.unit.count stated.unit.count
700 G...
THE HACK ON JERSEY CITY CONDO PRICES
Condo Unit Attributes
For each condo unit:
• Square Footage
• Sale Price
• Sale Date
...
THE HACK ON JERSEY CITY CONDO PRICES
Model the date dimension – price index
# step 2.3 checking condo price per sf over ti...
THE HACK ON JERSEY CITY CONDO PRICES
Last Missing Variable: The View from Units
Manually entered:
using public-domain floo...
THE HACK ON JERSEY CITY CONDO PRICES
Box Plot: Does View Matter?
THE HACK ON JERSEY CITY CONDO PRICES
Model the Price!
• Simple linear regression – one variable a time
• Multi linear regr...
THE HACK ON JERSEY CITY CONDO PRICES
Simple linear regression
# step 3.1 bi-variate linear model
findata <- read.csv("find...
THE HACK ON JERSEY CITY CONDO PRICES
Multi-linear regression
# step 3.2 multi-variate linear model
modelLM <- lm(PPSF~SqFt...
THE HACK ON JERSEY CITY CONDO PRICES
Model Tree
# step 3.3 model tree
modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingSc...
THE HACK ON JERSEY CITY CONDO PRICES
gbm
# step 3.4 gbm
findata <- read.csv("findata.csv")
modelGBM <- gbm(PPSF~SqFt+Floor...
THE HACK ON JERSEY CITY CONDO PRICES
Random Forest
# step 3.5 random forest
findata <- read.csv("findata.csv")
findata <- ...
THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation of All Models
# step 4.1 data partition
in_train <- createDataPartit...
THE HACK ON JERSEY CITY CONDO PRICES
Cross Validation Result
RMSE Total Universe Cross Validation
Multi-linear 99.17 99.55...
THE HACK ON JERSEY CITY CONDO PRICES
Wish List Items…
• More rigorous regression diagnostics
• Tuning models better
• Mode...
THANK YOU!
yiqun.wang@nyu.edu
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Technical SEO (Pagination & Crawling) by Adam Audette
Next
Download to read offline and view in fullscreen.

4

Share

Download to read offline

THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

Download to read offline

NYC Data Science Academy, student demo day, machine learning with R, Vivian Zhang, yiqun wan

THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data

  1. 1. THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data Yiqun “Yi” Wang / NYC Data Science Academy / Code for JC / March 2015
  2. 2. THE HACK ON JERSEY CITY CONDO PRICES Outline of the Project • Data • Tax assessment data • Third party data sources to join • Data janitor and collection works • Relationship Exploration • Building attributes exploration • Individual units price exploration • Model for Prices • 5 competing models • Cross validation
  3. 3. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data NJ MOD IV System Covers all individual properties Downloadable in batch in text files Key columns: - Property address - Property class - Property size - Year built - Owner address - Owner name - Last sold price - Last sold date - Qualifier (can parse out condo floor # and unit #)
  4. 4. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data • 62,270 totals property records, as of Feb 2015 -- filter down to -- • 3,867 individual condo units (of 19 selected mid/high-rise buildings) # step 1.1 tax data load from NJ MOD IV system url <- "http://tax1.co.monmouth.nj.us/download/0906monm204610.zip" download.file(url,"0906monm204610.zip",quiet = FALSE) closeAllConnections() unzip("0906monm204610.zip") taxdata <- read.csv(file="0906monm204610.csv")
  5. 5. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data is Dirty! # address cleanup taxdata$Property.Location <- gsub("STREET", "ST", taxdata$Property.Location) taxdata$Property.Location <- gsub("BOULEVARD", "BLVD", taxdata$Property.Location) taxdata$Property.Location <- gsub("[[:punct:]]", "", taxdata$Property.Location) taxdata$Property.Location <- gsub("[[:space:]]", "", taxdata$Property.Location) # drop bad prices, bad units, bad sf, bad records, retail condos taxdata <- taxdata[taxdata$Sale.Price>10000,] taxdata <- taxdata[taxdata$Sale.Price<=10000000,] taxdata <- taxdata[!(taxdata$Qual=="" | is.null(taxdata$Qual)),] taxdata <- taxdata[!(taxdata$Sq..Ft.=="" | is.null(taxdata$Sq..Ft.) | taxdata$Sq..Ft.<=400 | taxdata$Sq..Ft.>=3000),] taxdata <- taxdata[!(is.na(taxdata$Map.Page)),] taxdata <- taxdata[!(taxdata$Building.Class=="C"),] taxdata <- taxdata[!(substr(taxdata$Qual,4,4) %in% c("R","L","U")),]
  6. 6. THE HACK ON JERSEY CITY CONDO PRICES Tax Assessment Data has Hidden Treasures • From “Qual” we can parse out floor number and unit number taxdata$Floor <- ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“ substr(taxdata$Qual,3,4),substr(taxdata$Qual,2,3)) Floor <- ifelse(Floor=="PH",BuildingNumberOfStories,Floor) taxdata$Unit <- ifelse(taxdata$AddressClean=="389WASHINGTONST"|taxdata$AddressClean=="174WASHINGTONST“ substr(taxdata$Qual,5,5),substr(taxdata$Qual,4,5))
  7. 7. THE HACK ON JERSEY CITY CONDO PRICES Third Party Data Sources To Join • Condo building attributes: • http://livingonthehudson.com • http://www.jcboe.org • http://www.zillow.com • http://www.streeteasy.com • http://buyersadvisors.com • Location primness: • http://walkscore.com • Building Geocode / Transit location: • http://maps.google.com/maps/api/geocode/ • Census tract level demographics: • http://geomap.ffiec.gov
  8. 8. THE HACK ON JERSEY CITY CONDO PRICES Map Out All the Buildings # step 1.4.3 map the buildings out bldmaap <- ggmap(get_googlemap( center='Grove Street, Jersey City, NJ', zoom=14, maptype='roadmap'),extent='device') + geom_point(data=bldgeoc, aes(x=lon, y=lat),colour='darkblue', alpha=0.7, na.rm=TRUE, size=5) bldmaap ggsave(filename="map.png",plot = last_plot(),width=3,height=3)
  9. 9. THE HACK ON JERSEY CITY CONDO PRICES Building Attributes For each condo building: • Address / Lat / Lon • Number of Units • Number of Stories • Year Built • Walk Score • Census Tract Median Household Income • Distance to Water • Distance to PATH (subway) Station
  10. 10. THE HACK ON JERSEY CITY CONDO PRICES Building Data Table
  11. 11. THE HACK ON JERSEY CITY CONDO PRICES Building Scoring System – PLSR is superior # step 1.6 come up with a building primeness score using PCA/PLSR bld.pcr <- pcr(BuildingPPSF ~ OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data = blddata, validation = "CV") bld.pls <- plsr(BuildingPPSF ~ OrderYearBuilt+OrderWalkScore+OrderMedianHouseholdIncome+OrderDPTH+OrderDWATER, 1, data = blddata, validation = "CV") blddata$BuildingScore < predict(bld.pls,newdata=blddata) PCA TRAINING: % variance explained 1 comps X 35.29 BuildingPPSF 22.10 PLSR TRAINING: % variance explained 1 comps X 29.58 BuildingPPSF 62.48
  12. 12. THE HACK ON JERSEY CITY CONDO PRICES Building Scoring System – PLSR is superior
  13. 13. THE HACK ON JERSEY CITY CONDO PRICES Some Cross Checking on Buildings BuildingName calc.unit.count stated.unit.count 700 Grove 226 237 77 Hudson 407 420 Clermont Cove 97 NA Crystal Point 257 269 Fulton's Landing 106 105 Gulls Cove 301 432 Liberty Terrace 116 118 Mandalay on the Hudson 250 269 Montgomery Greene 102 113 Pier House 99 180 Portofino 264 NA Shore Club North 211 220 Shore Club South 214 220 Sugar House 48 65 The A Condominiums 238 250 The James Monroe 364 NA Trump Plaza 391 445 Waldo Lofts 80 82 Zephyr Lofts 96 102 among 16 buildings with known units: 3,527 total units 3,142 units covered 89% coverage
  14. 14. THE HACK ON JERSEY CITY CONDO PRICES Condo Unit Attributes For each condo unit: • Square Footage • Sale Price • Sale Date • Floor • Unit Number • Building Score
  15. 15. THE HACK ON JERSEY CITY CONDO PRICES Model the date dimension – price index # step 2.3 checking condo price per sf over time (price index) aggu <- ddply(.data=findata[Sale.YrQtr>="1999 Q1" & Sale.YrQtr<="2014 Q4" & !is.na(Sale.YrQtr) ,], .variables='Sale.YrQtr', summarize, calc.avg.ppsf=mean(SalePrice/SqFt,na.rm=TRUE) ) aggu$calc.avg.ppsf.r2q <- append(rollmean(aggu$calc.avg.ppsf, 2),rep(NA,1),after=0) aggu$calc.avg.ppsf.r4q <- append(rollmean(aggu$calc.avg.ppsf, 4),rep(NA,3),after=0) aggu$calc.avg.ppsf.r8q <- append(rollmean(aggu$calc.avg.ppsf, 8),rep(NA,7),after=0)
  16. 16. THE HACK ON JERSEY CITY CONDO PRICES Last Missing Variable: The View from Units Manually entered: using public-domain floor plan data and listing data and consulting broker friends Three categories: 2 – Great View 1 – Some View 0 – Nothing Special In the future, can look for text description in listing: - “Manhattan View” - “Bay View” - “Corner” - etc.
  17. 17. THE HACK ON JERSEY CITY CONDO PRICES Box Plot: Does View Matter?
  18. 18. THE HACK ON JERSEY CITY CONDO PRICES Model the Price! • Simple linear regression – one variable a time • Multi linear regression • Model Tree (Weka) • Generalized Boosted Regression Models (gbm) • Random Forest • Cross validation all the models
  19. 19. THE HACK ON JERSEY CITY CONDO PRICES Simple linear regression # step 3.1 bi-variate linear model findata <- read.csv("findata.csv") modelLMSqFt <- lm(PPSF~SqFt) summary(modelLMSqFt) #adjR2=0.0010, p<.05 OKAY NOT SIGNIFICANT modelLMFloor <- lm(PPSF~Floor) summary(modelLMFloor) #adjR2=0.1709, p<.05 GOOD modelLMBuildingScore <- lm(PPSF~BuildingScore) summary(modelLMBuildingScore) #adjR2=0.3584, p<.05 GOOD modelLMView <- lm(PPSF~View) summary(modelLMView) #adjR2=0.0959, p<.05 GOOD modelLMIndex <- lm(PPSF~Index) summary(modelLMIndex) #adjR2=0.1327, p<.05 GOOD
  20. 20. THE HACK ON JERSEY CITY CONDO PRICES Multi-linear regression # step 3.2 multi-variate linear model modelLM <- lm(PPSF~SqFt+Floor+View+Index+BuildingScore summary(modelLM) #adjR2=0.4602 PPSFHatLM <- predict(modelLM,findata) RMSE(PPSFHatLM, PPSF, na.rm=TRUE) #99.17106
  21. 21. THE HACK ON JERSEY CITY CONDO PRICES Model Tree # step 3.3 model tree modelMT <- M5P(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata) summary(modelMT) findata$PPSFHatMT <- predict(modelMT,findata) RMSE(findata$PPSFHatMT, findata$PPSF, na.rm=TRUE) #86.63803
  22. 22. THE HACK ON JERSEY CITY CONDO PRICES gbm # step 3.4 gbm findata <- read.csv("findata.csv") modelGBM <- gbm(PPSF~SqFt+Floor+View+Index+BuildingScore, data=findata,distribution="gaussian",n.trees=10000) summary(modelGBM) findata$PPSFHatGBM <- predict(modelGBM,newdata=findata,n.trees=10000) RMSE(findata$PPSFHatGBM, findata$PPSF) #90.60208
  23. 23. THE HACK ON JERSEY CITY CONDO PRICES Random Forest # step 3.5 random forest findata <- read.csv("findata.csv") findata <- findata[!is.na(findata$Index),] modelRF <- randomForest(PPSF~SqFt+Floor+View+Index+BuildingScore,data=findata) summary(modelRF) findata$PPSFHatRF <- predict(modelRF,newdata=findata) RMSE(findata$PPSFHatRF, findata$PPSF) #64.49773
  24. 24. THE HACK ON JERSEY CITY CONDO PRICES Cross Validation of All Models # step 4.1 data partition in_train <- createDataPartition(findata$PPSF, p=0.75, list=FALSE) findata_train <- findata[in_train,] findata_test <- findata[-in_train,] rmse_cv <- function(k,train){ m <- nrow(train) num <- sample(1:10,m,replace=T) rmse <- numeric(10) for (i in 1:10) { data.t <- train[num!=i, ] data.v <- train[num==i, ] model <- <MODEL>(PPSF~SqFt+Floor+View+Index+BuildingScore,data=data.t) pred <- predict(model,newdata=data.v) rmse[i] <- RMSE(pred,data.v$PPSF) } return(mean(rmse)) } rmse <- sapply(1:100,rmse_cv,findata_train)
  25. 25. THE HACK ON JERSEY CITY CONDO PRICES Cross Validation Result RMSE Total Universe Cross Validation Multi-linear 99.17 99.55 Model Tree (Weka M5P) 86.63 78.46 GBM 90.60 91.67 RandomForest 64.49 82.19
  26. 26. THE HACK ON JERSEY CITY CONDO PRICES Wish List Items… • More rigorous regression diagnostics • Tuning models better • Model blending • Compare with Zestimate
  27. 27. THANK YOU! yiqun.wang@nyu.edu
  • JonDuan

    Sep. 16, 2015
  • SelvaKumar277

    Apr. 6, 2015
  • PaulGullas

    Apr. 4, 2015
  • YuChingShih1

    Mar. 24, 2015

NYC Data Science Academy, student demo day, machine learning with R, Vivian Zhang, yiqun wan

Views

Total views

1,904

On Slideshare

0

From embeds

0

Number of embeds

398

Actions

Downloads

41

Shares

0

Comments

0

Likes

4

×