The Role of Data Science in Real Estate

Introductions
Account Executive at CARTO CEO & Data Scientist at Geolytix Solutions Engineer at CARTO

Which community are you here
from?

Which use
cases are
you typically
focused on?
Investment
Analysis
Indoor
Mapping
Site Planning Market Analysis
Trade Area
Analysis
Pricing
Optimization

APAC 9.8%
Industry
Participants
Seniority Region
43% individual contributors
45% Mid management
12% Senior management
North
America
47%
EMEA
38.6%
Latin America 4.5%

Data Science and GIS teams at organizations

How many Data
Scientists
actually know
about spatial?

Which
technologies
and
languages
are preferred
in the
industry?

Discovering data useful
for their analysis
Evaluating and
purchasing data
ETLing the data
into common
structures
Analyzing, doing
feature extraction and
modeling
30% 30% 20% 20%
Where do SDS spend time?

A Data Scientist needed demographics and zip code data for Portugal to perform a
particular market analysis:
An example:

80%
of participants believe that it is
difficult or very difficult to hire Data
Scientists with expertise in spatial
analysis
1. Strong background in statistics
2. Extensive experience in coding skills relating to Data
Science (Spark, SQL, Python, R, Tensorflow, Pytorch)
3. Experience developing production-quality data products
using the results of quantitative research
4. Extensive experience in data visualization (in Python and R
or other applications)
5. Effective application of Data Science workflows to
business problems, and the ability to storytell around
results
6. Familiarity with data pipelines and ETL practices (Airflow,
scheduled notebooks, Google DataFlow, etc.)
7. Familiarity with neural networks and deep learning (e.g.
Tensorflow, PyTorch)
8. Experience working with distributed computing systems
like Spark or Google BigQuery
9. Experience working with GIS software such as CARTO,
QGIS, or ArcGIS

47% of participants do not find it challenging to
identify the right software & data to support
Spatial Data Science projects
How difficult
Is it to find the
right software
and data?

How will
investment in
Spatial Data
Science
initiatives
expand?
68% of organizations
are likely to increase
their investment in
Spatial Data Science in
the next 2 years

New data & new use cases.
Let’s discuss!

Real Estate Meetup
Jaime Sánchez, Solutions & Customer Success at CARTO

The Sum of Our Parts
The Complete Journey
As an organization, we have deﬁned 5 steps that, together, create a
holistic Location Intelligence approach.
Our goal is to empower organizations as they traverse each of these 5
steps.

Spatial analysis in 5 key steps:
Data Ingestion
&
Management
Data
Enrichment
Analysis Solutions &
Visualization
Integration
Clean, geocode
and visualize your
data.
Clustering, outliers
analysis, time series
predictions, and
geospatial weighted
regression, change
spatial support
Using 3rd party
datasets — ideally
on standardized
spatial
aggregations to
reduce your time
to insight.
WebGL for big
datasets,
dashboards,
widgets, apps.
Productize model
into a web service
API.
Feed back into
Data Warehouse,
LoB systems,
consumers, others.

1. Data Ingestion & Management
● Spatial database with multiple ways to connect
and manipulate your data
● Dynamic data in the cloud and multiple data
sources: local and remote ﬁles, cloud storages,
other databases, and more
● Fully managed database with automatic backups
and regular upgrades
● Enterprise data sharing and access across CARTO
Wide support for geospatial formats (inc. Shapeﬁles, KML, KMZ, GeoJSON,
GPX, OSM, GeoPackage, GDB, CSV, Excel or OpenDocument).
Plug ready database connectors (ArcGIS Server, DB Connectors via APIs
(MySQL, PostgresSQL, Microsoft SQL Server, Hive on request)).

2. Data Enrichment
● Save time in gathering spatial data, augmenting
your existing data with new location data
streams from across the globe
● Create locations from addresses and
understand travel time all from within CARTO
● Develop robust ETL processes and update
mechanisms so your data is always enriched
● Premium data to understand and analyze
deeper trends and behavior

3. Analysis
● Bring maps and data into your Data Science
workﬂows and the Python data science
ecosystem with CARTOframes
● Machine learning embedded in CARTO as
simple SQL calls for clustering, outliers analysis,
time series predictions, and geospatial
weighted regression
● Use the power of PostGIS and our APIs to
productionalize analysis workﬂows in your
CARTO platform

4. Solutions & Visualization
● Develop and build custom applications with a
full suite of frontend libraries.
● Work with CARTO’s Professional Services and
Support team as and when you need it.
● Create lightweight, intuitive dashboards for
simple sharing of insights across your
organization.

5. Integration
● Using CARTO’s APIs and SDKs, connect your
analysis into the places that matter most for
you and your team.
● Bring CARTO to other data destinations, such as
desktop GIS and BI tools.
● Embed CARTO inside other tools, such as
Salesforce Einstein Analytics or Qlik Sense.
● Work with our Professional Services team for
custom conﬁgurations or developments.

Let’s apply the journey to
a real world business
question!

How can we analyze and
understand real estate sales
in Los Angeles?

Pains
1. “Disconnected experiences to consume data - it is broken into
separate tools, teams DBs, excels.”
2. “Limited developer time in our team.”
3. “Current data science workﬂow doesn’t have a geo focus. and Spatial
modeling is cumbersome because I have to export results to XYZ
tool in order to visualize and test my model eﬀectively.”
4. “Having trouble handling and visualizing big datasets.“

Outline the Process
1. Integrate spatial data of past home sales and property locations
in Los Angeles county
2. Enrich the data with a spatial context using a variety of relevant
resources (demographics, mastercard transactions, OSM)
3. Clean and analyze the data, and create a predictive model for
homes that have not sold
4. Present the results in a Location Intelligence solution for users
5. Integrate and deploy the model into current workﬂows for day
to day use

Integrate LA Housing Data
The Los Angeles County Assessor's oﬃce provides two diﬀerent datasets
which we can use for this analysis:
● All Property Parcels in Los Angeles County for Record Year 2018
● All Property Sales from 2017 to Present

CREATE TABLE la_join AS
SELECT s.*,
p.zipcode as zipcode_p,
p.taxratearea_city,
p.ain as ain_p,
p.rollyear,
p.taxratearea,
p.assessorid,
p.propertylocation,
p.propertytype,
p.propertyusecode,
p.generalusetype,
p.specificusetype,
p.specificusedetail1,
p.specificusedetail2,
p.totbuildingdatalines,
p.yearbuilt as yearbuilt_p,
p.effectiveyearbuilt,
p.sqftmain,
p.bedrooms as bedrooms_p,
p.bathrooms as bathrooms_p,
p.units,
p.recordingdate,
p.landvalue,
p.landbaseyear,
p.improvementvalue,
p.impbaseyear,
p.the_geom as centroid
FROM sales_parcels s
LEFT JOIN assessor_parcels_data_2018 p ON s.ain::numeric = p.ain
Clean and join the data on unique
identiﬁer using SQL

Integrate LA Housing Data
Next we want to add spatial context to our housing data to
understand more about the areas around:
● Demographics
● Mastercard (Scores and Merchants) (Nearest 5 Areas)
● Nearby Grocery Stores and Restaurants
● Proximity to Roads

Demographics
Add total population and median income from the US Census

Mastercard
Find the merchants and sales/growth scores in the ﬁve nearest block
groups to the home via Mastercard Retail Location Insights data

(
SELECT AVG(sales_metro_score)
FROM (
SELECT sales_metro_score
FROM mc_blocks
ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom
LIMIT 5
) a
) as sale_metro_score_knn,
(
SELECT AVG(growth_metro_score)
FROM (
SELECT growth_metro_score
FROM mc_blocks
ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom
LIMIT 5
) a
) as growth_metro_score_knn

Grocery Stores/Restaurants
Find the number of grocery stores and restaurants using
OpenStreetMap Data and the SQL API.

(
SELECT count(restaurants_la.*)
FROM restaurants_la
WHERE ST_DWithin(
ST_Centroid(la_eval_clean.the_geom_webmercator),
restaurants_la.the_geom_webmercator,
1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom)))))
) as restaurants,
(
SELECT count(grocery_la.*)
FROM grocery_la
WHERE ST_DWithin(
ST_Centroid(la_eval_clean.the_geom_webmercator),
grocery_la.the_geom_webmercator,
) as grocery_stores

Roads
See if a home is within one mile of a major highway or trunk highway
using the SQL API and major roads from OpenStreetMap.

(
SELECT CASE WHEN COUNT(la_roads.*) > 0 THEN 1 ELSE 0 END
FROM la_roads
WHERE ST_DWithin(
la_eval_clean.the_geom_webmercator,
la_roads.the_geom_webmercator,
AND highway in ('motorway', 'trunk')
) as highways_in_1mile

Analysis
The analysis for this project followed the following steps:
● Moran’s I Clusters & Outliers (Exploratory Data Analysis)
● Neighbor Homes Analysis (Spatial Feature Engineering)
● Predictive Modeling & Hyperparameter Tuning (using XGBoost)

Moran’s I
Using Moran’s I to evaluate spatial clusters and outliers via the PySAL
package, we can see these groupings and visualize them in
CARTOframes.

The Sum of Our PartsThe Sum of Our Parts
Moran’s I

Neighbor Analysis
Evaluate the attributes of
neighbor properties using
k-nearest neighbor spatial
weights in PySAL to perform
spatial feature engineering.

how the attributes of your neighbors inﬂuence the price of your home and spatial
context…

Predictive Modeling
Using XGBoost we can use this data to create a regression model to
predict housing prices and push that data back to CARTO using
CARTOframes, never leaving the notebook environment.

The Sum of Our PartsThe Sum of Our Parts
Sale
Price Past Sales
Spatial Data Enrichment
Spatial Modeling
Analyze the values of nearest neighbor sales,
clusters of high Mastercard areas, proximity to
features
Train & Test Model
Predictions
Spatial Feature Engineering

Predictive Modeling
After hyperparameter tuning the model, we can reduce the Mean
Average Error down to $58,179.78.

Feature Importance

Solutions
To present the data and predictive analysis, both on data from the
model that has a sales price and for homes that have not sold, we
can develop a location intelligence application to showcase these
results.

Los-Angeles
Prediction
Explorer

Application Development
Deploy the model via a Python
based API and sync to data to
perform on the ﬂy predictions
for speciﬁc properties.

Other Use Cases
● Predicting revenue from different physical retail locations
● Identify clusters and groups of specific patterns to optimize
activities such as sales outreach or site selection
● Classify property types or buying patterns in a city
● Review spatial feature importance for site performance, and
modify models using different spatial components
● find areas with similar behavioral patterns

Similarity Analysis
We built a model to identify areas with similar
behavior patterns based on footfall, socio economic
and ﬁnancial data and more. The similarity score is
modeled based on:
● Distance between cells is calculated with a L2
norm on a Principal Component space.
● Uncertainty due to missing values and
dimension of PC space is tackled following an
ensemble probabilistic approach.
● Similarity Score = Continuous Rank
Probability Skill Score.
By enriching the data with other sources this model
can be used for Site Planning, Investment Analysis,
etc.

POPULATION HOUSEHOLD INCOME
VISITORS TRANSACTIONS
109 S 5th Street Brooklyn NY 11249

DEMOGRAPHICS
● Population
● Household spending
● Household income

DEMOGRAPHICS
● Number of visitors
HUMAN MOBILITY

DEMOGRAPHICS
ROAD TRAFFIC
● Number of vehicles
HUMAN MOBILITY

DEMOGRAPHICS
ROAD TRAFFIC
FINANCIAL
HUMAN MOBILITY
● Ticket size
● Number of
Transactions

● Oﬃces
● Shops
● Transport
POIs
DEMOGRAPHICS
ROAD TRAFFIC
FINANCIAL
HUMAN MOBILITY

Let’s see the Python notebook….

The Role of Data Science in Real
Estate

Network strategy
Location planning
Omnichannel analysis
Spatial modelling
Our whole business is about location planning. As trusted
advisors we help our customers decide how many stores,
who to acquire, where to open, which format and how to
optimise home delivery and click & collect operations.
Team of 36 location
specialists to work
collaboratively with your
business
Led in-house location planning for major global retailers.
Experts in spatial modelling, forecasting, web development
and systems.
Create innovative new datasets for local markets.
Growing to a global
company
Offices in London, Leeds, Warsaw, Dortmund, Shanghai,
Tokyo and Melbourne
INTRODUCTION

2. MODEL1.
DATA
3. TOOL
OUR OFFER

HISTORY
Clients Key Events Team
2012 Sainsbury’s
Whole Foods
Foundation 1
2013 ASDA, Boots
Waitrose
ASDA project transformative, enables growth.
Build key datasets.
4
2014 Post Office,
Camelot, Barclays
New multi-year deals giving confidence.
Take office space. Evolve data offer.
6
2015 Amazon, Swinton,
Savills
Growth in ‘adjacent’ spaces.
Invest in capacity and recurring revenue growth.
10
2016 M&S, TRG, EE, On
the market
Growing & diversifying the client list.
Exploring innovative global DAAS solutions.
15
2017 Adidas, Rightmove
Dominos
Growth in international markets, Shanghai & Tokyo office
open. Leeds office opens in the UK.
18
2018 Costa, Dr Martens Warsaw office opens.
Launched MAPP, our online map based analytical tool
24
2019 Lego, Starbucks Melbourne office opens
Large multi-country, multi-brand advise
35

REAL ESTATE
FOUNDATIONAL FEATURES
• Decisions are complex and outcomes only become
clear over years
• Choices are multi-faceted and driven by dynamic
competing interests
• Key information is tightly held
• The amounts of money involved are vast
• Decisions are hard to undo
• “Retailers make few decisions that are as
permanent and unforgiving as selecting store
locations.”

SOME HISTORY
SPATIAL DATA SCIENCE
• William Playfair – 1780s
• Charles Minard – 1830s
• John Snow – 1850s
• Charles Booth – 1890s
• Roger Tomlinson – 1960s
• Arthur Samuel – 1950s
• David Huff – 1970s

SOME ISSUES
UNDERPINNING STATISTICAL ISSUES
Samples are not randomly drawn from the variable space
Items from within the sample influence each other
Hardly any variables are normally distributed
These three features fatally wound pretty much every standard statistical approach
WHAT DATA SCIENCE CAN DO
Describe things
Classify stuff
Predict responses
What we really want to do is to predict the future

THE GP ANALOGUE
Diagnose the
business problem
Bring in the specialists if you need them
(e.g. algorithm/model creation)
Communicate to business stakeholders
Support decision making

WHAT IS MACHINE LEARNING?
● “Machine Learning” refers to the field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959)
● In practical terms, a series of different algorithms can be applied to detect patterns in data
(including big data), which can lead to actionable insights
● Common machine learning applications include (not extensively):
● Regression Forecasting (e.g. sales forecasts)
● Classification or Clustering (e.g. segmentation or image classification)
● Association Rule Learning (interesting relations; e.g. which other products are you likely to
buy based on your other purchases?)
● Reinforcement Learning (e.g. chess AI)

IT’S NOT NEW
● Machine learning is not a new concept… In 1952
Arthur Samuel wrote the first computer program
which learned as it ran
● First neural network to solve a real world problem
was designed in 1959 (an adaptive filter to remove
echoes from phone lines)
● So if ML isn’t new, why is it becoming so popular
now?

COMMON TYPES OF MACHINE LEARNING ALGORITHMS
● Supervised Learning:
● The user (human) teaches the algorithm by providing it with input data and a sample of
result data (e.g. x = input features, y = actual sales)
● The algorithm then attempts to learn from the input data how best to predict a result (e.g.
predict sales)
● Unsupervised Learning:
● The computer is trained with unlabelled data; there is no teacher
● This family of machine learning algorithms is useful for pattern recognition and rule detection
● Semi-supervised Learning:
● A combination of supervised and unsupervised methods
● Reinforcement Learning:
● Maximises reward and minimises risk, iteratively learning from the environment
● Determines ideal behaviour within specific contexts

USE CASE FOR WITHIN LOCATION PLANNING
● So what’s the catch!?
● How can we use this in location planning/real estate!?
•
•
•
•
•
•
•

EXAMPLES OF USE CASES
● Using K-nearest neighbour to create demographic segmentations, based on known customer
data
● Learning about key drivers of success by examining feature importance
● Building forecasting models to predict sales based on property location
● Using NLP for categorising customer comments
● Etc.
One of the more interesting solutions we’ve used recently combines traditional methods with
machine learning…

GROCERY GRAVITY MODEL
● Gravity models are common practice within the grocery retail location planning/real estate place
● It is important for grocers to understand which locations would be ideal for a new supermarket,
but also to understand the impact this might have on existing locations and competitors…
Gravity Model in a nutshell:
● Based on theory of gravity
● More attractive destinations have a greater ‘pull’
● Attraction is linked to distance
● Using customer data we know how far people
actually travel to their chosen stores
● Fundamental concept is logical, and simple to
understand

● Gravity models are often very accurate at estimating
customer patterns and interactions at close range…
● However, this accuracy usually wanes as you try to
model sales from further afield:
● Consumers decisions are much harder to
understand
● Consumers have more choice
● Are they workers or residents?
● Decision is not as simple as “I’ll just pop
into my nearest, most attractive
supermarket”…

● The solution… To use machine learning to create an
estimate for ‘Sales beyond 30mins’
● Created a datamart for each property in the portfolio (see
opposite) and tested various machine learning algorithms
to see if we could more accurately predict sales than
previously
● Eventually settled on a neural network
It’s not as easily interpretable, but gives better results
on interactions which are inherently difficult to
understand anyway!
√ +20% of store sales beyond 30 minutes drivetime were
more accurately predicted
√ R² increased by 0.25 for beyond sales

EXAMPLE FASHION CLIENT
OBJECTIVE
With six stores operating in Hong
Kong, Dr. Martens wanted to
understand how high the achievable
turnover is at each location.
Additionally, an understanding of the
best locations for new stores was
required as part of a future store
investment roadmap.
RESULT
We created new datasets and a
bespoke model to calculate sales
potentials for the existing store
network. The model was then used in
a future opportunity scan to identify
the best locations for new stores in
Hong Kong.
STRATEGY
MODEL DEMAND: Calculate how
much people spend on footwear at the
lowest possible geography
MAP THE RETAIL LANDSCAPE:
Understand the locations where
retailers cluster in Hong Kong
SALES POTENTIAL: Calculate how
much turnover is achievable at a retail
venue (e.g. Mall) and individual store
level
OPPORTUNITY SCAN: Use the
developed model and data sets to find
ideal locations for the next Dr. Martens
stores in Hong Kong

STRATEGY
• Understand the true drivers of store performance &
the impact on nearby stores of opening new sites.
• Predict new store sales and cannibalisation using a
consistent, transparent fact base and model.
• Improve the efficiency of the store forecasting
process to allow more time for the value-add.
• Deliver the ideal network blueprint and optimum
network strategy.
EXAMPLE F&B CLIENT
"Our work with GEOLYTIX has enabled us to form a
consistent approach to new site forecasting, step
changing our understanding of customers catchments
and improving our ability to understand regional and
store performance. The collaborative approach has
resulted in us being able to make decisions around our
future location strategy and form ideal network
blueprints with significantly increased confidence.”
Craig Donnellan, Head of Location Planning
Dominos Pizza.
OBJECTIVE
Support the Dominos strategy to be the number one
pizza company in each neighbourhood with a focus on
franchisee profitability.
RESULT

EXAMPLE RETAIL CLIENT: FOOD, FASHION & HOME
OBJECTIVE
Support a step change in the roll-out of the Food estate,
understand the drivers of performance for the Clothing &
Home estate and recommend the optimum network
blueprint.
RESULT
“GEOLYTIX have worked with us to create a bespoke
toolset enabling us to proactively set our strategy and
quickly answer any What if scenarios. Their analysis and
recommendations have provided us with a consistent
evidence base from which to make our network
decisions."
STRATEGY
• Create an efficient selection & sales forecasting
process, based on a rigorous, objective fact base and
a consistent approach.
• Understand the drivers and catchments of the
Clothing & Home estate, in order to build optimal
networks.
• Integrate custom models with existing data and
software to create the M&S modelling toolkit.
• Bulk run multiple national and regional scenarios to
guide network strategy and create future blueprints.

EXAMPLE REAL ESTATE ADVIOR PROJECT
OBJECTIVE
Data / analytical support in evaluating potential acquisition opportunities and ongoing asset
management of retail assets.
STRATEGY
• Creation of town centre & grocery gravity models to
asses:
• Catchment profiles and fit to various potential
new occupiers
• Impacts of new greenfield developments and
centre remodels
• Existing retailer chain performance and potential
‘best next’ opportunities
• Ad-hoc consultancy support
• Assisting with major M&A and liquidity event
support
• Detailed asset reports including site visits to
support redevelopment
RESULT
• We provide access to our data and models through
a desktop GIS reporting tool which allows for:
• Ad hoc area demographic reporting
• Retail presence and chain list reports
• Analogue tool to find similar locations
• Drive time reporting
• Bespoke ‘client ready’ site and area reports

WHY GEOLYTIX
World Class Modelling. We have delivered optimisation models for many of the most successful organisations
in the world, across multiple sectors.
Innovation. The Queen’s Award for Innovation reflects our passion for being on the leading edge of new data,
technology, and ideas.
Practical Senior-Level Experience. We are practical operators, with Director-level experience in property teams
of some of the UK’s largest companies.
Technical Expertise. We are experienced data scientists, sales forecasting modellers and spatial web
application developers, and will build a bespoke solution. Every element of our solution, from the analysis to the
platform, will be specifically designed to meet your specific requirements.
We are global. Accounting for the often vast differences in structure, maturity and data availability and quality we
are able to apply a consistent approach across territories in order to support fact-based decisions.
Proven Track Record. We have delivered similar solutions many times before. We will deliver to spec, to time,
and to a fixed budget.
Genuine Partnership. Our commitment is to work closely with you through to deployment, and maintain the
support and relationship beyond.

Thank you!
lrutherford@cartodb.com CEO & Data Scientist at Geolytix jsanchez@cartodb.com

The Role of Data Science in Real Estate

More Related Content

What's hot

Similar to The Role of Data Science in Real Estate

More from CARTO

Recently uploaded

The Role of Data Science in Real Estate