Performed association study and modeled real-time data using Linear Regression, Random Forest, Boosted Trees, Gradient Boosting, LASSO to predict house prices in Brooklyn for next few years
• Have used and demonstrated CRISP-DM methodology throughout the project.
• Used RapidMiner tool to automatically adapt all the possible attributes and operator to provide the prediction.
• Have used different algorithms like Decision tree, Random forest, and Gradient boosted tree to predict price distribution and created the simulation of the result.
Digital 2023 Thailand (February 2023) v01DataReportal
All the data, statistics, and trends you need to make sense of digital in Thailand in 2023. Includes the latest reported numbers for internet users, social media users, and mobile connections in Thailand, as well as key indicators of ecommerce use. For more reports, including the latest global trends and individual data for more than 230 countries around the world, visit https://datareportal.com/
Digital 2022 Latvia (February 2022) v01DataReportal
All the data, statistics, and trends you need to make sense of digital in Latvia in 2022. Includes the latest reported numbers for internet users, social media users, and mobile connections in Latvia, as well as key indicators of ecommerce use. For more reports, including the latest global trends and individual data for more than 230 countries around the world, visit https://datareportal.com/
Data Science: Prediction analysis for houses in Ames, Iowa.ASHISH MENKUDALE
For the vastly diversified realty market, with prices of properties increasing exponentially, it becomes essential to study the factors which affect directly or indirectly when a customer decides to buy a house and to predict the market trend. In general, for any purchase, a potential customer makes the decision based on the value for the money.
The problem statement was taken from the website Kaggle. We chose this specific problem because it provided us an opportunity to build a prediction model for real-life problems like the prediction of prices for houses in Ames, Iowa.
• Have used and demonstrated CRISP-DM methodology throughout the project.
• Used RapidMiner tool to automatically adapt all the possible attributes and operator to provide the prediction.
• Have used different algorithms like Decision tree, Random forest, and Gradient boosted tree to predict price distribution and created the simulation of the result.
Digital 2023 Thailand (February 2023) v01DataReportal
All the data, statistics, and trends you need to make sense of digital in Thailand in 2023. Includes the latest reported numbers for internet users, social media users, and mobile connections in Thailand, as well as key indicators of ecommerce use. For more reports, including the latest global trends and individual data for more than 230 countries around the world, visit https://datareportal.com/
Digital 2022 Latvia (February 2022) v01DataReportal
All the data, statistics, and trends you need to make sense of digital in Latvia in 2022. Includes the latest reported numbers for internet users, social media users, and mobile connections in Latvia, as well as key indicators of ecommerce use. For more reports, including the latest global trends and individual data for more than 230 countries around the world, visit https://datareportal.com/
Data Science: Prediction analysis for houses in Ames, Iowa.ASHISH MENKUDALE
For the vastly diversified realty market, with prices of properties increasing exponentially, it becomes essential to study the factors which affect directly or indirectly when a customer decides to buy a house and to predict the market trend. In general, for any purchase, a potential customer makes the decision based on the value for the money.
The problem statement was taken from the website Kaggle. We chose this specific problem because it provided us an opportunity to build a prediction model for real-life problems like the prediction of prices for houses in Ames, Iowa.
Our aim was to develop algorithms which use a broad spectrum of features to predict real prices. Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sber bank to provide more certainty to their customers in an uncertain economy.
FIN 353 Appraisal Project Whitepaper Guideline List of fChereCheek752
FIN 353 Appraisal Project Whitepaper Guideline
List of files on titanium-Read Before Writing Your Report!!!
“Sales Comparison Approach”
“Zillow Comps”
“Appraisal Example FIN 353”
“Appraisal_Example_Full_
Solution
”
Everyone will prepare an appraisal report on one subject residential real estate property in
FIN 353. The report carries 20 points out of total 100 points of your grade. Everyone must submit
an appraisal report for a residential real estate property (NOT a new construction), covering all the
relevant issues appropriate to the assigned subject property. The report should present a thorough
discussion and analysis of the valuation for the assigned subject project based on sales comparison
approach (Please refer to the ppt on “Sales Comparison Approach” on Titanium for more
information about it). Your main job is to collect the information of newly listed subject property
and the information of 3-4 recent sold properties similar to subject property (comparables) from
the websites. Then conduct a thorough analysis of the subject property’s value using sales
comparison approach. The relevant information on subject property and comparable properties
could be found at the website providing information on real estate transactions. It is free access for
everyone. You could learn about the recent listings and recent sales for any place you are interested
through these websites. The most frequently used website are the following websites:
• Google: “## City Home for sale”
• Zillow
• Ziprealty
• Realtor.com
• Trulia
• Redfin
Sales comparison approach
• Principal
• Substitution principle
• One price rule – equivalent goods tend to sell for
equivalent prices
• Approach
• Identify sales of similar properties
• Adjust sale prices to reflect differences from the subject
property
• Typical adjustments include sales date, size, age,
condition, location, amenities and key features
Subject property: You may choose four bedrooms or five bedrooms single family property as the
subject property of appraisal at a city with active housing market. An active housing market means
there are a lot of transactions going on. This property should be a property currently listed for sale.
The newly listed subject property should have a time on market (TOM)1 less than 3 months. If a
property has been listed on the market for fairly long time without a sale, there may be some hidden
problem with the property, or the listed price is not set well. The new listings information could
be found in all the major websites for real estate transactions. Do NOT use listing for new
construction as your subject property, which is not a good choice for practicing the sales
comparison approach.
Selection of Comparables: It is very important to select good comparables. The more similar the
comparables relative to the subject property, the less adjustment you do and the more accurate is
...
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Commercial valuation of property is the prime requirement for investments. Real estate values depend on many elements, such as the present cost of the land, taxation on the lad, the depreciation rates, and others. It is essential that a feasibility analysis be conducted first before the land is invested into. Given this context this report basically attempts to do a feasibility analysis for a property using the Estate Master Feasibility analysis tool. A number of inputs are given based on a case scenario for property. The report attempts to evaluate and critically discusses the real world scenario presented by the Estate master for each of these sets of inputs. A feasibility analysis based on commercial valuation methodology is carried out first, followed by a valuation of the site as is. The residual value is calculated here. The report then calculates the project returns based on the residual values using the Estate Master and in the second part of the report, sensitivity and risks analysis are covered.
Rating of house property is the pivot of this study. As it seems, rating of any underlying asset is necessary for having capital gains. Moreover, a rated asset helps with actual valuation of the concerned subject.
Simulation of real estate price environmentSohin Shah
”Computer Simulation for Real Estate Price
Environment” focuses on the price determination of
real estate in Mumbai. The project recognizes and
quantifies factors that play a crucial role in the final
determination of the price of real estate. Major effort
lies in recognizing and evaluating non-quantifiable
factors like location, local infrastructure, and
connectivity, which impact pricing even though they
cannot be valued directly in monetary terms. These
factors along with the samples of the real estate prices
in Mumbai are used to develop a mathematical model
that would give accurate predictions of the prices.
Finally, this model would be employed to simulate the
real world real estate environment, which would enable
the buyer as well as the developer to study the market
under different scenarios and make intelligent
decisions. Also, the noticeable factor in this situation is
that the description tends to assume pattern recognition
problem, and therefore neural networks with back
propagation will be used for implementation. The
system shall be trained based on the history in form of
data collected for which errors can also be minimized to
achieve results with less deviation.
Our aim was to develop algorithms which use a broad spectrum of features to predict real prices. Algorithm applications rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sber bank to provide more certainty to their customers in an uncertain economy.
FIN 353 Appraisal Project Whitepaper Guideline List of fChereCheek752
FIN 353 Appraisal Project Whitepaper Guideline
List of files on titanium-Read Before Writing Your Report!!!
“Sales Comparison Approach”
“Zillow Comps”
“Appraisal Example FIN 353”
“Appraisal_Example_Full_
Solution
”
Everyone will prepare an appraisal report on one subject residential real estate property in
FIN 353. The report carries 20 points out of total 100 points of your grade. Everyone must submit
an appraisal report for a residential real estate property (NOT a new construction), covering all the
relevant issues appropriate to the assigned subject property. The report should present a thorough
discussion and analysis of the valuation for the assigned subject project based on sales comparison
approach (Please refer to the ppt on “Sales Comparison Approach” on Titanium for more
information about it). Your main job is to collect the information of newly listed subject property
and the information of 3-4 recent sold properties similar to subject property (comparables) from
the websites. Then conduct a thorough analysis of the subject property’s value using sales
comparison approach. The relevant information on subject property and comparable properties
could be found at the website providing information on real estate transactions. It is free access for
everyone. You could learn about the recent listings and recent sales for any place you are interested
through these websites. The most frequently used website are the following websites:
• Google: “## City Home for sale”
• Zillow
• Ziprealty
• Realtor.com
• Trulia
• Redfin
Sales comparison approach
• Principal
• Substitution principle
• One price rule – equivalent goods tend to sell for
equivalent prices
• Approach
• Identify sales of similar properties
• Adjust sale prices to reflect differences from the subject
property
• Typical adjustments include sales date, size, age,
condition, location, amenities and key features
Subject property: You may choose four bedrooms or five bedrooms single family property as the
subject property of appraisal at a city with active housing market. An active housing market means
there are a lot of transactions going on. This property should be a property currently listed for sale.
The newly listed subject property should have a time on market (TOM)1 less than 3 months. If a
property has been listed on the market for fairly long time without a sale, there may be some hidden
problem with the property, or the listed price is not set well. The new listings information could
be found in all the major websites for real estate transactions. Do NOT use listing for new
construction as your subject property, which is not a good choice for practicing the sales
comparison approach.
Selection of Comparables: It is very important to select good comparables. The more similar the
comparables relative to the subject property, the less adjustment you do and the more accurate is
...
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Commercial valuation of property is the prime requirement for investments. Real estate values depend on many elements, such as the present cost of the land, taxation on the lad, the depreciation rates, and others. It is essential that a feasibility analysis be conducted first before the land is invested into. Given this context this report basically attempts to do a feasibility analysis for a property using the Estate Master Feasibility analysis tool. A number of inputs are given based on a case scenario for property. The report attempts to evaluate and critically discusses the real world scenario presented by the Estate master for each of these sets of inputs. A feasibility analysis based on commercial valuation methodology is carried out first, followed by a valuation of the site as is. The residual value is calculated here. The report then calculates the project returns based on the residual values using the Estate Master and in the second part of the report, sensitivity and risks analysis are covered.
Rating of house property is the pivot of this study. As it seems, rating of any underlying asset is necessary for having capital gains. Moreover, a rated asset helps with actual valuation of the concerned subject.
Simulation of real estate price environmentSohin Shah
”Computer Simulation for Real Estate Price
Environment” focuses on the price determination of
real estate in Mumbai. The project recognizes and
quantifies factors that play a crucial role in the final
determination of the price of real estate. Major effort
lies in recognizing and evaluating non-quantifiable
factors like location, local infrastructure, and
connectivity, which impact pricing even though they
cannot be valued directly in monetary terms. These
factors along with the samples of the real estate prices
in Mumbai are used to develop a mathematical model
that would give accurate predictions of the prices.
Finally, this model would be employed to simulate the
real world real estate environment, which would enable
the buyer as well as the developer to study the market
under different scenarios and make intelligent
decisions. Also, the noticeable factor in this situation is
that the description tends to assume pattern recognition
problem, and therefore neural networks with back
propagation will be used for implementation. The
system shall be trained based on the history in form of
data collected for which errors can also be minimized to
achieve results with less deviation.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Contents
Problem Statement/Objective.............................................................................................................3
Source ...............................................................................................................................................3
Exploratory Data Analysis....................................................................................................................3
Characteristics that affect housing prices.............................................................................................4
Data Modification...............................................................................................................................6
Log Transformations...........................................................................................................................8
Normalizing and Standardizing the data...............................................................................................8
One-Hot Encoding: Categorical Data..................................................................................................10
Modeling.........................................................................................................................................11
(a)Linear Regression: ........................................................................................................................11
(b) Regularized regression: Lasso and Ridge: ......................................................................................12
(c)Random Forest:............................................................................................................................13
(d)Boosted Tree (Xgboost):................................................................................................................14
Learning Curve .................................................................................................................................16
Recommendations............................................................................................................................16
1)Recommendations for customer who are investors:........................................................................16
2)Recommendations for Builders:......................................................................................................17
3. Problem Statement/Objective
Introduction: NewYork City'sshortage of affordable housinghasreachedacrisis point.EspeciallyinBrooklyn,ademand
supplygaphasledtoa continuousincrease inhouseprices.The pricesare determinedbyseveral factors,includingsize of
the apartment,neighborhood,proximitytocommercial hubsand otheramenitiesetc. Brooklyn housingsalespriceshas
seen a continuous increase from 2003 to 2017, and these variables have played a major role in determining the sales
prices.
Objective: To get the approximate sale prices for houses in Brooklyn in the next few years by utilizing regression
techniques and visualization
SEMMA Approach to Predictive Modeling
EverymodelingprojectmustfollowSEMMA approachi.e.once afterthe businessgoal isdefinedwe needtofollowthe
stepsof SEMMA. Inour projectwe made sure to follow thesestepsanddraw insightsfromthis accordingly.
Source
The data has been taken from following link in Kaggle
https://www.kaggle.com/tianhwu/brooklynhomes2003to2017
The primarydatasetfor the HousingSalesdatahas beentakenfromthe NYC Departmentof Finance site.Inaddition,the
data for other important housing variables has been obtained from NYC Department of City Planning.
Exploratory Data Analysis
The initial data consisted of 390883 rows, and 109 variables along with the sales_price variable. To remove redundant
variables, an analysis of the variablesaffecting sales price was done and we brought down the number of independent
variables to 27.
Exploration of the data was accomplished using data visualization techniques in R. Bar plots were used for categorical
variables andscatterplots and qqplots proved to be useful to understand the continuous data relation with sale price.
Removing missing values
Due to the unstructurednature of the data,manyof the datahad
missing values and were improperly arranged.
As we observe in the missing map graph below, 17 categorical
variables had many missing values. It’s difficult to impute the
valuesforthesecolumnsasall predictorsare blank inthatspecific
row. Hence, we dropped the observations (22.3% of the value)
4. Characteristics that affect housing prices
Intermsof the housingcharacteristicsthatwere usedinourregressionmodels,we wanteda well-roundedlistof housing
characteristics encompassed in these three broad categories: Property Characteristics, Community Characteristics and
ProximityCharacteristics. We wanted to include more than just structural variables in our hedonic regression models.
Property Characteristics.
The housingpricesare largelydependentonthe house characteristics.In general, largerhomes withmore floorsandarea
size tend to sell for higher prices. The number of bedrooms and bathrooms tend to increase the sale price, evenafter
controllingotherphysical,locational andqualityfeatures. Areaof the propertyis highlyproportional tothe sale price of
the house. From our data, we plotted the different house characteristics against the sale_price and came to some
conclusions about their relation. Notably the housing characteristics variables in our data are the following
Landsqft, Gross Sqft, Bldg Front size, Lot Size, Residential Units, commercialunits, NumFloors, NumBuildings, UnitsRes,
GarageArea, YearAlter.
After doing EDA of the variables we observed the following characteristics:
The housing price increases with the area of the land sqft, i.e. they are directlyproportional,while the housing price in
general increases with gross sqft increase till 8000 sqft, then it remains almost constant.
The number of residential units is mostly from 1 to 3, with very les units above 4. But from the right-hand graph, we
observe that the sales_price in general increases with the number of residential units
5. Most of the commercial units is 0, and very less units above 2. But from the right-hand graph, we observe that the
sales_price in general increases with the number of commercial units
The sale_price shows a very good linear relation with the Assesses
Total Price, i.e. the sale price is in line with Assessed Total price.
Neighborhood Characteristics
In addition to the housing characteristics, surroundings of the house also play an important role in prediction of house
prices. The housing prices are dependent on the NeighborhoodCharacteristics. While choosing a house to live in every
individual thinksof the surroundingsandthe localityof the house.In general,houseswhichare close to shoppingmalls,
Schools, Work locations, and restaurants etc. are preferred over houses located at deserted place. From the available
data,we have plottedthe differentneighborhoodcharacteristicsagainstthe sale_price anddrew someconclusionsabout
their relation.
The following is a list of neighborhood characteristics variables in our data.
Neighborhood, tax_class, building_class , school dist, council, police pcrt, health cent, easements.
We observed the following results after performing EDA of the variables mentioned above.
From the graph it can be inferredthatthe house prices inthe Windsorterrace and Red Hook neighborhoodsare costlier
and the house price withbath beachsurroundingsare cheaper.The price of housesinthese surroundingsrangesfroma
nominal value to very high values.
6. Tax class:
Houses which are categorized in the tax class 4,2C and 2B are priced
very high. Whereas houses in tax class 1 record low comparatively
prices.
Police pcrt:
Salesprice ishighestinBrooklynNorth havingPolice pcrtin (088, 090,
094), while the sale price is lowest in Brooklyn south area.
Data Modification
Feature Engineering
Based on our observations, we found that there were many outliers which would not be contributing to the model. To
remove the unnecessaryvalues, datatreatmentandfeatureengineeringwasdone.We madethe changesonthe following
variables
Removing outliers
Variable Filtered and considered values
Sale_price $10000-$3000000
LandSqft 1000-7000
GrossSqft 100-10000
BldgFront Less than 200
Residential Units Below 8
Numbldgs Less than 4
NumFloors Less than5
UnitsRes Less than 9
Dealing with Categorical Variables
GarageArea- Aswe observedthatmostof the buildingsdonothave garages,we groupedthe variable into 1, i.e.
have garages and 0, i.e. don’t have garages
CommercialUnits –the variablewasgroupedinto1, i.e. commercial unitspresentelse 0, i.e. commercial unitsnot
present
Year_built– If the yearbuiltwas between1850 to 1900, then1, if it was builtbetween 1900 to 2000 then 2 else
after 2000 is 3
7. Combining Attributes
YearAlter1and YearAlter2– Insteadof keepingdifferentrandomvariables,we groupedthese togetherto 0, i.e.
not altered, 1, i.e. altered once and 2, i.e. altered twice
Correlation check:
Next,we proceededwith acorrelation plottofindif there wasahighrelationship amongstthe variables. A highcorrelation
meansthat highvaluesof one are associatedwithhigh valuesof the other,andthatlow valuesof one are associatedwith
lowvaluesof the other,hence thiswouldimpact the variableshave onthe outcome. Afterdoingthe correlationplot,we
checked the VIF for variables having high correlation.
We observed that the variable residential unit has a high correlation with UnitsRes and UnitsTotal and gross_sqft. The
variable Units_Res also has a high correlation with gross_sqft and unitsTotal. On checking the VIF of these variables,we
found it high. We took VIF cutoff as 4 and proceeded with removing the variables residential units and Units_res
Initial correlation plot
8. Normalizing and Standardizing the data
Log Transformations
The main reasonwhywe use logtransformationistoreduce skewness in our data. However, there are other reasons
why we log transform our data:
Easier to interpret patterns of our data.
For possible statistical analysis that require the data to be normalized.
Skewedness:
A skewness of zero or near zero indicates a symmetric distribution.
A negative value for the skewness indicate a left skewness (tail to the left)
A positive value for the skewness indicate a right skewness (tail to the right)
Kurtosis:
Kurtosis is a measure of how extreme observations are in a dataset.
The greater the kurtosis coefficient, the more peaked the distribution around the mean is.
Greater coefficient also means fatter tails, which means there is an increase in tail risk (extreme results)
We have numerical variables such as land_sqft, gross_sqft, BldgFront, AssessTot. When we checked the skewnessof
these numerical variables with cutoff 0.8, we saw that the skewness of these variables is high. Regression has an
assumptionof multivariate normality.Itmeansthatregressionrequiresall itsvariablestobe normal.Byhavingskewed
data,we violate the assumptionof normality.The kurtosisvaluesare alsohighformentionedvariables.We checkedthe
kurtosis with cutoff absolute 3.
Before Transformation:
9. We tried Cox-Box transformation to transform the skewed numeric variables for optimized value of lambda. We
transformed the variables using log transformation as well. The variables containing 0 are transformed using
“log (1 + value)” and the variables, not containing 0 in entire range are transformed using “log(value)”.
After Transformation:
Buildingfrontstillhashighkurtosis.We checkedforthe highvaluesof buildingfrontandprobabilityof itbeingoutliers.
10. Standardizing the data
Scale: The scale transform calculates the standard deviation for an attribute and divides each value by SD.
Center: The center transform calculates the mean for an attribute and subtracts it from each value.
Standardize: Scaling and centering help to standardize the data; mean = 0 and SD =1.
We used‘caret’package inR whichhas‘preProcess’functiontostandardize the numerical attributes.We standardized
below attributes:
1. Gross Sqft
2. Building Front
3. Assess Total
4. Land Sqft
‘predict’functionincarethelpstoapplythe transformationto original dataset.We mergedthedependentvariable (Sale
price) and transformed numerical data after this step.
One-Hot Encoding: Categorical Data
One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original
data.Ittransformsthe categorical variable(s)toaformatthatworksbetterwithclassificationandregressionalgorithms.
Regressionmodels treatall independentvariablesasnumericinitspurestform.Inour case,we have multiple columns
which are categorical in nature. E.g. Neighborhood is a categorical in nature and it should not be translated to
‘categorical values’, assigning a numerical value to each neighborhood.
The valuesinthe original dataare Red, Yellow and Green.We create
a separate column for each possible value.
Whereverthe original valuewas Red,we puta 1 inthe Red column.
We used ‘dummyVars’ function in caret package to transform
categorical columnsto binarycolumns.‘predict’functionincaretis
usedto merge the new binarycolumnsbackto categorical dataset.
Transforming output variable
Without transformation: After transformation:
11. The dependentvariable ‘SalePrice’wasskewedtothe right and is not normalized. The variable doesnotshownormal
distributionafterusinglogtransformation,butafterusingsquare roottransformationthe skewnessisreducedandthe
density plot shows a normal distribution as shown above.
Modeling
Model Preparation:
We used two approaches for model data preparation.
1. Train-test data split
2. K fold CV
Model Tuning (Feedback Model Improvement):
We used cook’sdistance to remove the leveragesandfeedbackthe datatoimprove the model performanceandcreate
the new dataset which we used for building other models.
(a)Linear Regression:
We created a multiple linear regression model as a base model in R using all the retained numeric and categorical
variables.Asstatedinthe above part,the retainedvariablesincludescalednumericvariablesandmodifiedcategorical
variables through one hot encoding. We created a custom function in R to measure the adjusted RSquare of the
validation (test) data.
We usedthe data afterremovingthe highleveragesusingcook’sdistanceformulaasournew dataset.We splitin60:40
trainingvalidationdataset.We used‘sqrt’transformationfordependentvariable‘sale_price’asitwasrightskewed.We
squared the output of the prediction method while comparing it with sale_price of the validation dataset.
Validity of linear regression model:
Input data
Linear Regression
Model
Cook’s Distance
Remove leverage
Improved
Result
12. As we can see from adjoining figures
1. The residuals don’t form any visible pattern in ‘residuals vs
fitted’ graph validating an assumption of homoscedasticity
2. Normal Q-Q plotvalidatesthe assumptionthatall variablesto
be multivariate normal.
The linearregressionmodel gave anadjustedRSquare value of
68.87% of validation dataset. Running the same model might
change the result if seed is not set.
(b) Regularized regression: Lasso and Ridge:
We usedregularizedregressionmodel viz.Lasso(L1 regularization) andRidge (L2 regularization) tocheckand improve
model. We used ‘caret’ package in R to train the model and check the validation.We used below parameters to train
the model:
Lasso:
We createda sequence of lambdasfrom 0 to 1, increasingby0.01. Analpha value is held
constantat 1 forLasso.We used‘sqrt’functiontotransformthe dependentvariableinthe
model and squared the prediction result calculated using validation dataset.
Method
Repeated
CV
Number 10
Repeats 10
Metric RMSE
Alpha 1
Lambda 0 to 1
13. Tuning Parameter:
As we can from the above graph, the lambda = 0.02 gave minimum RMSE
and MAE as well as maximum RSquared value. The Lasso model gave
adjusted RSquared value of 68.80% on validation dataset.
Ridge:
We usedlambda= 0.02 that we got from Lasso tuningparameter.Analphavalue isheld
constantat 0 for Ridge.We used‘sqrt’functionto transform the dependent variable in
the model and squared the prediction result calculated using validation dataset.
The ridge model gave an adjusted
RSquared value of 68.69% on
validation dataset.
(c)Random Forest:
We created a random forest model in R. We used 60-40 training validation data split.We ran model with ntree = 600
initially to check the performance of the model with a change in number of trees.
Method
Repeate
d CV
Number 10
Repeats 10
Metric RMSE
Alpha 1
Lambda 0.02
14. Parameter Selection:
1. Number of Trees:
The graphisplottedbetweenthe numberof treesanderrorassociated
withit. We can see the error rate saturatesbeyondntrees > 200. The
error rate slowly goes down if we increase the number of trees.
2. Number of columns in each iteration:
By default, the random forest takes sqrt (independent columns) to
create trees in each iteration.
Variable Importance:
The image showsthe variouscolumnsandtheir contributions
in the model. The contributions dictate how much each
column contributes towardsthe node purity in various trees
present in the random forest. The columns with high
contribution does not necessarily mean the positive
correlationwithdependentvariable.Itcanleadtopositiveor
negative effect.
The variable importance graphalongwiththe graphscreated
in EDA help to decide the impact and direction of impact on
dependent variable.
(d)Boosted Tree (Xgboost):
We usedxgboostpackage inRto create the boostedtree.We used60-40trainingvalidationdatasplit.The boostedtree
was createdbygradientboostingtechnique.We usedmultiple techniquestofine tune the model.The final model was
an improvedversionof the baseline boostedmodel. Aftertryingforthe multiple parameters,the followingparameters
were finalized as they gave better results compared to others.
Parameter Value Description
booster gbtree tree based model
objective
reg:line
ar Linear regression
colsample_bytree 0.2 subsample ratio of columns
eta 0.01 Learning rate
min_child_weight 2 min number of instances needed to be in each node
max_depth 4 maximum depth of tree
alpha 0.3 L1 regularization
lambda 0.8 L2 regularization
15. gamma 0.01 minimum loss reduction
subsample 0.8 subsample ratio of training instance
silent TRUE TRUE - silent mode, F - printing message
eval_metric rmse Evaluation metric
We usedcross validationtoexamineourmodel.We checked
with large number (10000) of trees and algorithm gave the
best number of decision trees in final model. We used early
stoppinginxgboostaswe were not sure howmany treeswe
need. Once we got the best number of iterations required
(5856), we trained our model of this number keeping other
things constant, lowering the early stop point.
The boosted tree gave ab adjusted RSquare (70.37%).
Model Comparison
Model R Sq. Value
Linear Regression Full Model 68.15
LASSO Regression 68.41
RIDGE Regression 68.22
Random Forest 69.73
Boosted Tree 70.37
As we observe, thee best Model is Boosted Tree which gives an R- square of 70.37.
16. Learning Curve
The graph showsthe LearningCurve forour bestmodel.
The Train Error increases with increase in the training
data size while the Test Error decreases gradually with
the increase in training data size.
The crossoverpointshowsthat the data beyond50% of
the dataset would not lead to any significant impact on
the model.
Ways to improve the analysis
Availability of certain other variables like Natural
Disaster-ProneAreas,UnemploymentRate,CrimeRate,
Ethnicity can improve the analysis and provide more
insights.
We didanalysisusingGDPdatavs SalesPrice,asshown
in the graph
Recommendations
1)Recommendations for customer who are investors:
The assessed value is a clear indicator of the sales price and the trendis positive linear. So, the investor can take this
intoaccountandforevery$2000 increase inthe total assessmentthe salesprice increasesby$7,50,000,whichisaround
125% increase of current value for every $2000 increase in assesses total price.
Investors should take in to account that if a place is getting developed residentially for example if there is only 1
residentialunitand2 more units are getting built there, it can increase the sales price by 150% of the current value.
Noticeably the investor doesn’t need necessarily need to spend extra on a building with a Garage.
Also,buyingwithinabuildingwhichhas1 flooras of now butis planningtobuild3floorsshouldbe able to get250% of
the current sales price.
0
10000000
20000000
30000000
40000000
50000000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Learning Curve
train test
17. 2)Recommendations for Builders:
Since the Sales price of building increases sharply if the building front is increasedfrom 12feet to 40feet and remains
constantafterwards.The optimumbuildingfrontabuildercanlookforis40feetandcanfetchasalesprice of $3,00,000.
The sale price increaseswithnumberof floorswithinarange of 1 to 4. So, buildercanoptimize the price atnumberof
floors equal to 4.
A builderdoesn’tnecessarilyneedtoutilizeresourcesonthe availabilityof garage toincrease the outputsince there is
no relationship between the two.
The plotarea andsale price followanalmostpositivelinearrelationshipbetweenthe range of 1000-7000 sq.ft area.So
optimal floor area builder can plan for would be 7000 sq. ft.
Also,since the price increaseswithresidential units,buildercanlookfor an area with6-7 residential unitstomaximize
the sales price.
Appendix:
https://www.kaggle.com/tianhwu/brooklynhomes2003to2017
glossary_finance_N
Y.pdf
-------------------------------------------------------------------END-----------------------------------------------------------------------------
--