Welcome!
Real Estate Meetup
Introductions
Account Executive at CARTO CEO & Data Scientist at Geolytix Solutions Engineer at CARTO
Which community are you here
from?
Which use
cases are
you typically
focused on?
Investment
Analysis
Indoor
Mapping
Site Planning Market Analysis
Trade Area
Analysis
Pricing
Optimization
APAC 9.8%
Industry
Participants
Seniority Region
43% individual contributors
45% Mid management
12% Senior management
North
America
47%
EMEA
38.6%
Latin America 4.5%
Data Science and GIS teams at organizations
Data Science and GIS teams at organizations
How many Data
Scientists
actually know
about spatial?
Which
technologies
and
languages
are preferred
in the
industry?
Discovering data useful
for their analysis
Evaluating and
purchasing data
ETLing the data
into common
structures
Analyzing, doing
feature extraction and
modeling
30% 30% 20% 20%
Where do SDS spend time?
A Data Scientist needed demographics and zip code data for Portugal to perform a
particular market analysis:
An example:
80%
of participants believe that it is
difficult or very difficult to hire Data
Scientists with expertise in spatial
analysis
1. Strong background in statistics
2. Extensive experience in coding skills relating to Data
Science (Spark, SQL, Python, R, Tensorflow, Pytorch)
3. Experience developing production-quality data products
using the results of quantitative research
4. Extensive experience in data visualization (in Python and R
or other applications)
5. Effective application of Data Science workflows to
business problems, and the ability to storytell around
results
6. Familiarity with data pipelines and ETL practices (Airflow,
scheduled notebooks, Google DataFlow, etc.)
7. Familiarity with neural networks and deep learning (e.g.
Tensorflow, PyTorch)
8. Experience working with distributed computing systems
like Spark or Google BigQuery
9. Experience working with GIS software such as CARTO,
QGIS, or ArcGIS
47% of participants do not find it challenging to
identify the right software & data to support
Spatial Data Science projects
How difficult
Is it to find the
right software
and data?
How will
investment in
Spatial Data
Science
initiatives
expand?
68% of organizations
are likely to increase
their investment in
Spatial Data Science in
the next 2 years
New data & new use cases.
Let’s discuss!
Use code: #SPATIAL15
Real Estate Meetup
Jaime Sánchez, Solutions & Customer Success at CARTO
The Sum of Our Parts
The Complete Journey
As an organization, we have defined 5 steps that, together, create a
holistic Location Intelligence approach.
Our goal is to empower organizations as they traverse each of these 5
steps.
Spatial analysis in 5 key steps:
Data Ingestion
&
Management
Data
Enrichment
Analysis Solutions &
Visualization
Integration
Clean, geocode
and visualize your
data.
Clustering, outliers
analysis, time series
predictions, and
geospatial weighted
regression, change
spatial support
Using 3rd party
datasets — ideally
on standardized
spatial
aggregations to
reduce your time
to insight.
WebGL for big
datasets,
dashboards,
widgets, apps.
Productize model
into a web service
API.
Feed back into
Data Warehouse,
LoB systems,
consumers, others.
1. Data Ingestion & Management
● Spatial database with multiple ways to connect
and manipulate your data
● Dynamic data in the cloud and multiple data
sources: local and remote files, cloud storages,
other databases, and more
● Fully managed database with automatic backups
and regular upgrades
● Enterprise data sharing and access across CARTO
Wide support for geospatial formats (inc. Shapefiles, KML, KMZ, GeoJSON,
GPX, OSM, GeoPackage, GDB, CSV, Excel or OpenDocument).
Plug ready database connectors (ArcGIS Server, DB Connectors via APIs
(MySQL, PostgresSQL, Microsoft SQL Server, Hive on request)).
2. Data Enrichment
● Save time in gathering spatial data, augmenting
your existing data with new location data
streams from across the globe
● Create locations from addresses and
understand travel time all from within CARTO
● Develop robust ETL processes and update
mechanisms so your data is always enriched
● Premium data to understand and analyze
deeper trends and behavior
3. Analysis
● Bring maps and data into your Data Science
workflows and the Python data science
ecosystem with CARTOframes
● Machine learning embedded in CARTO as
simple SQL calls for clustering, outliers analysis,
time series predictions, and geospatial
weighted regression
● Use the power of PostGIS and our APIs to
productionalize analysis workflows in your
CARTO platform
4. Solutions & Visualization
● Develop and build custom applications with a
full suite of frontend libraries.
● Work with CARTO’s Professional Services and
Support team as and when you need it.
● Create lightweight, intuitive dashboards for
simple sharing of insights across your
organization.
5. Integration
● Using CARTO’s APIs and SDKs, connect your
analysis into the places that matter most for
you and your team.
● Bring CARTO to other data destinations, such as
desktop GIS and BI tools.
● Embed CARTO inside other tools, such as
Salesforce Einstein Analytics or Qlik Sense.
● Work with our Professional Services team for
custom configurations or developments.
Let’s apply the journey to
a real world business
question!
How can we analyze and
understand real estate sales
in Los Angeles?
Pains
1. “Disconnected experiences to consume data - it is broken into
separate tools, teams DBs, excels.”
2. “Limited developer time in our team.”
3. “Current data science workflow doesn’t have a geo focus. and Spatial
modeling is cumbersome because I have to export results to XYZ
tool in order to visualize and test my model effectively.”
4. “Having trouble handling and visualizing big datasets.“
Outline the Process
1. Integrate spatial data of past home sales and property locations
in Los Angeles county
2. Enrich the data with a spatial context using a variety of relevant
resources (demographics, mastercard transactions, OSM)
3. Clean and analyze the data, and create a predictive model for
homes that have not sold
4. Present the results in a Location Intelligence solution for users
5. Integrate and deploy the model into current workflows for day
to day use
1. Data
Integrate LA Housing Data
The Los Angeles County Assessor's office provides two different datasets
which we can use for this analysis:
● All Property Parcels in Los Angeles County for Record Year 2018
● All Property Sales from 2017 to Present
2018 Parcel Data
2018 Parcel Data
Past Sales Data
Past Sales Data
CREATE TABLE la_join AS
SELECT s.*,
p.zipcode as zipcode_p,
p.taxratearea_city,
p.ain as ain_p,
p.rollyear,
p.taxratearea,
p.assessorid,
p.propertylocation,
p.propertytype,
p.propertyusecode,
p.generalusetype,
p.specificusetype,
p.specificusedetail1,
p.specificusedetail2,
p.totbuildingdatalines,
p.yearbuilt as yearbuilt_p,
p.effectiveyearbuilt,
p.sqftmain,
p.bedrooms as bedrooms_p,
p.bathrooms as bathrooms_p,
p.units,
p.recordingdate,
p.landvalue,
p.landbaseyear,
p.improvementvalue,
p.impbaseyear,
p.the_geom as centroid
FROM sales_parcels s
LEFT JOIN assessor_parcels_data_2018 p ON s.ain::numeric = p.ain
Clean and join the data on unique
identifier using SQL
2. Enrichment
Integrate LA Housing Data
Next we want to add spatial context to our housing data to
understand more about the areas around:
● Demographics
● Mastercard (Scores and Merchants) (Nearest 5 Areas)
● Nearby Grocery Stores and Restaurants
● Proximity to Roads
Demographics
Add total population and median income from the US Census
Mastercard
Find the merchants and sales/growth scores in the five nearest block
groups to the home via Mastercard Retail Location Insights data
(
SELECT AVG(sales_metro_score)
FROM (
SELECT sales_metro_score
FROM mc_blocks
ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom
LIMIT 5
) a
) as sale_metro_score_knn,
(
SELECT AVG(growth_metro_score)
FROM (
SELECT growth_metro_score
FROM mc_blocks
ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom
LIMIT 5
) a
) as growth_metro_score_knn
Grocery Stores/Restaurants
Find the number of grocery stores and restaurants using
OpenStreetMap Data and the SQL API.
(
SELECT count(restaurants_la.*)
FROM restaurants_la
WHERE ST_DWithin(
ST_Centroid(la_eval_clean.the_geom_webmercator),
restaurants_la.the_geom_webmercator,
1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom)))))
) as restaurants,
(
SELECT count(grocery_la.*)
FROM grocery_la
WHERE ST_DWithin(
ST_Centroid(la_eval_clean.the_geom_webmercator),
grocery_la.the_geom_webmercator,
1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom)))))
) as grocery_stores
Roads
See if a home is within one mile of a major highway or trunk highway
using the SQL API and major roads from OpenStreetMap.
(
SELECT CASE WHEN COUNT(la_roads.*) > 0 THEN 1 ELSE 0 END
FROM la_roads
WHERE ST_DWithin(
la_eval_clean.the_geom_webmercator,
la_roads.the_geom_webmercator,
1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom)))))
AND highway in ('motorway', 'trunk')
) as highways_in_1mile
3. Analysis
Analysis
The analysis for this project followed the following steps:
● Moran’s I Clusters & Outliers (Exploratory Data Analysis)
● Neighbor Homes Analysis (Spatial Feature Engineering)
● Predictive Modeling & Hyperparameter Tuning (using XGBoost)
Moran’s I
Using Moran’s I to evaluate spatial clusters and outliers via the PySAL
package, we can see these groupings and visualize them in
CARTOframes.
The Sum of Our PartsThe Sum of Our Parts
Moran’s I
The Sum of Our Parts
Neighbor Analysis
Evaluate the attributes of
neighbor properties using
k-nearest neighbor spatial
weights in PySAL to perform
spatial feature engineering.
The Sum of Our Parts
how the attributes of your neighbors influence the price of your home and spatial
context…
The Sum of Our Parts
The Sum of Our Parts
Predictive Modeling
Using XGBoost we can use this data to create a regression model to
predict housing prices and push that data back to CARTO using
CARTOframes, never leaving the notebook environment.
The Sum of Our PartsThe Sum of Our Parts
Sale
Price Past Sales
Spatial Data Enrichment
Spatial Modeling
Analyze the values of nearest neighbor sales,
clusters of high Mastercard areas, proximity to
features
Train & Test Model
Predictions
Spatial Feature Engineering
The Sum of Our Parts
Predictive Modeling
After hyperparameter tuning the model, we can reduce the Mean
Average Error down to $58,179.78.
The Sum of Our Parts
Feature Importance
4. Solutions
The Sum of Our Parts
Solutions
To present the data and predictive analysis, both on data from the
model that has a sales price and for homes that have not sold, we
can develop a location intelligence application to showcase these
results.
Los-Angeles
Prediction
Explorer
5. Integration
Application Development
Deploy the model via a Python
based API and sync to data to
perform on the fly predictions
for specific properties.
The Sum of Our Parts
Other Use Cases
● Predicting revenue from different physical retail locations
● Identify clusters and groups of specific patterns to optimize
activities such as sales outreach or site selection
● Classify property types or buying patterns in a city
● Review spatial feature importance for site performance, and
modify models using different spatial components
● find areas with similar behavioral patterns
Similarity Analysis
We built a model to identify areas with similar
behavior patterns based on footfall, socio economic
and financial data and more. The similarity score is
modeled based on:
● Distance between cells is calculated with a L2
norm on a Principal Component space.
● Uncertainty due to missing values and
dimension of PC space is tackled following an
ensemble probabilistic approach.
● Similarity Score = Continuous Rank
Probability Skill Score.
By enriching the data with other sources this model
can be used for Site Planning, Investment Analysis,
etc.
POPULATION HOUSEHOLD INCOME
VISITORS TRANSACTIONS
109 S 5th Street Brooklyn NY 11249
DEMOGRAPHICS
● Population
● Household spending
● Household income
DEMOGRAPHICS
● Number of visitors
HUMAN MOBILITY
DEMOGRAPHICS
ROAD TRAFFIC
● Number of vehicles
HUMAN MOBILITY
DEMOGRAPHICS
ROAD TRAFFIC
FINANCIAL
HUMAN MOBILITY
● Ticket size
● Number of
Transactions
● Offices
● Shops
● Transport
POIs
DEMOGRAPHICS
ROAD TRAFFIC
FINANCIAL
HUMAN MOBILITY
Let’s see the Python notebook….
The Role of Data Science in Real
Estate
Network strategy
Location planning
Omnichannel analysis
Spatial modelling
Our whole business is about location planning. As trusted
advisors we help our customers decide how many stores,
who to acquire, where to open, which format and how to
optimise home delivery and click & collect operations.
Team of 36 location
specialists to work
collaboratively with your
business
Led in-house location planning for major global retailers.
Experts in spatial modelling, forecasting, web development
and systems.
Create innovative new datasets for local markets.
Growing to a global
company
Offices in London, Leeds, Warsaw, Dortmund, Shanghai,
Tokyo and Melbourne
INTRODUCTION
2. MODEL1.
DATA
3. TOOL
OUR OFFER
HISTORY
Clients Key Events Team
2012 Sainsbury’s
Whole Foods
Foundation 1
2013 ASDA, Boots
Waitrose
ASDA project transformative, enables growth.
Build key datasets.
4
2014 Post Office,
Camelot, Barclays
New multi-year deals giving confidence.
Take office space. Evolve data offer.
6
2015 Amazon, Swinton,
Savills
Growth in ‘adjacent’ spaces.
Invest in capacity and recurring revenue growth.
10
2016 M&S, TRG, EE, On
the market
Growing & diversifying the client list.
Exploring innovative global DAAS solutions.
15
2017 Adidas, Rightmove
Dominos
Growth in international markets, Shanghai & Tokyo office
open. Leeds office opens in the UK.
18
2018 Costa, Dr Martens Warsaw office opens.
Launched MAPP, our online map based analytical tool
24
2019 Lego, Starbucks Melbourne office opens
Large multi-country, multi-brand advise
35
OUR COVERAGE
PARTNERSHIPS
2013
WHERE PEOPLE ARE
HOW RICH ARE THEY
WHERE CAN THEY SHOP
DATA SCIENCE
REAL ESTATE
FOUNDATIONAL FEATURES
• Decisions are complex and outcomes only become
clear over years
• Choices are multi-faceted and driven by dynamic
competing interests
• Key information is tightly held
• The amounts of money involved are vast
• Decisions are hard to undo
• “Retailers make few decisions that are as
permanent and unforgiving as selecting store
locations.”
SOME HISTORY
SPATIAL DATA SCIENCE
• William Playfair – 1780s
• Charles Minard – 1830s
• John Snow – 1850s
• Charles Booth – 1890s
• Roger Tomlinson – 1960s
• Arthur Samuel – 1950s
• David Huff – 1970s
SOME ISSUES
UNDERPINNING STATISTICAL ISSUES
Samples are not randomly drawn from the variable space
Items from within the sample influence each other
Hardly any variables are normally distributed
These three features fatally wound pretty much every standard statistical approach
WHAT DATA SCIENCE CAN DO
Describe things
Classify stuff
Predict responses
What we really want to do is to predict the future
THE GP ANALOGUE
Diagnose the
business problem
Bring in the specialists if you need them
(e.g. algorithm/model creation)
Communicate to business stakeholders
Support decision making
Machine Learning
WHAT IS MACHINE LEARNING?
● “Machine Learning” refers to the field of study that gives computers the ability to learn without
being explicitly programmed (Samuel, 1959)
● In practical terms, a series of different algorithms can be applied to detect patterns in data
(including big data), which can lead to actionable insights
● Common machine learning applications include (not extensively):
● Regression Forecasting (e.g. sales forecasts)
● Classification or Clustering (e.g. segmentation or image classification)
● Association Rule Learning (interesting relations; e.g. which other products are you likely to
buy based on your other purchases?)
● Reinforcement Learning (e.g. chess AI)
IT’S NOT NEW
● Machine learning is not a new concept… In 1952
Arthur Samuel wrote the first computer program
which learned as it ran
● First neural network to solve a real world problem
was designed in 1959 (an adaptive filter to remove
echoes from phone lines)
● So if ML isn’t new, why is it becoming so popular
now?
WHY NOW? - MOORE’S LAW
COMMON TYPES OF MACHINE LEARNING ALGORITHMS
● Supervised Learning:
● The user (human) teaches the algorithm by providing it with input data and a sample of
result data (e.g. x = input features, y = actual sales)
● The algorithm then attempts to learn from the input data how best to predict a result (e.g.
predict sales)
● Unsupervised Learning:
● The computer is trained with unlabelled data; there is no teacher
● This family of machine learning algorithms is useful for pattern recognition and rule detection
● Semi-supervised Learning:
● A combination of supervised and unsupervised methods
● Reinforcement Learning:
● Maximises reward and minimises risk, iteratively learning from the environment
● Determines ideal behaviour within specific contexts
USE CASE FOR WITHIN LOCATION PLANNING
● So what’s the catch!?
● How can we use this in location planning/real estate!?
•
•
•
•
•
•
•
Case Studies
EXAMPLES OF USE CASES
● Using K-nearest neighbour to create demographic segmentations, based on known customer
data
● Learning about key drivers of success by examining feature importance
● Building forecasting models to predict sales based on property location
● Using NLP for categorising customer comments
● Etc.
One of the more interesting solutions we’ve used recently combines traditional methods with
machine learning…
GROCERY GRAVITY MODEL
● Gravity models are common practice within the grocery retail location planning/real estate place
● It is important for grocers to understand which locations would be ideal for a new supermarket,
but also to understand the impact this might have on existing locations and competitors…
Gravity Model in a nutshell:
● Based on theory of gravity
● More attractive destinations have a greater ‘pull’
● Attraction is linked to distance
● Using customer data we know how far people
actually travel to their chosen stores
● Fundamental concept is logical, and simple to
understand
GROCERY GRAVITY MODEL
● Gravity models are often very accurate at estimating
customer patterns and interactions at close range…
● However, this accuracy usually wanes as you try to
model sales from further afield:
● Consumers decisions are much harder to
understand
● Consumers have more choice
● Are they workers or residents?
● Decision is not as simple as “I’ll just pop
into my nearest, most attractive
supermarket”…
GROCERY GRAVITY MODEL
● The solution… To use machine learning to create an
estimate for ‘Sales beyond 30mins’
● Created a datamart for each property in the portfolio (see
opposite) and tested various machine learning algorithms
to see if we could more accurately predict sales than
previously
● Eventually settled on a neural network
It’s not as easily interpretable, but gives better results
on interactions which are inherently difficult to
understand anyway!
√ +20% of store sales beyond 30 minutes drivetime were
more accurately predicted
√ R² increased by 0.25 for beyond sales
CASE STUDIES
SOME OF OUR CUSTOMERS
EXAMPLE FASHION CLIENT
OBJECTIVE
With six stores operating in Hong
Kong, Dr. Martens wanted to
understand how high the achievable
turnover is at each location.
Additionally, an understanding of the
best locations for new stores was
required as part of a future store
investment roadmap.
RESULT
We created new datasets and a
bespoke model to calculate sales
potentials for the existing store
network. The model was then used in
a future opportunity scan to identify
the best locations for new stores in
Hong Kong.
STRATEGY
MODEL DEMAND: Calculate how
much people spend on footwear at the
lowest possible geography
MAP THE RETAIL LANDSCAPE:
Understand the locations where
retailers cluster in Hong Kong
SALES POTENTIAL: Calculate how
much turnover is achievable at a retail
venue (e.g. Mall) and individual store
level
OPPORTUNITY SCAN: Use the
developed model and data sets to find
ideal locations for the next Dr. Martens
stores in Hong Kong
STRATEGY
• Understand the true drivers of store performance &
the impact on nearby stores of opening new sites.
• Predict new store sales and cannibalisation using a
consistent, transparent fact base and model.
• Improve the efficiency of the store forecasting
process to allow more time for the value-add.
• Deliver the ideal network blueprint and optimum
network strategy.
EXAMPLE F&B CLIENT
"Our work with GEOLYTIX has enabled us to form a
consistent approach to new site forecasting, step
changing our understanding of customers catchments
and improving our ability to understand regional and
store performance. The collaborative approach has
resulted in us being able to make decisions around our
future location strategy and form ideal network
blueprints with significantly increased confidence.”
Craig Donnellan, Head of Location Planning
Dominos Pizza.
OBJECTIVE
Support the Dominos strategy to be the number one
pizza company in each neighbourhood with a focus on
franchisee profitability.
RESULT
EXAMPLE RETAIL CLIENT: FOOD, FASHION & HOME
OBJECTIVE
Support a step change in the roll-out of the Food estate,
understand the drivers of performance for the Clothing &
Home estate and recommend the optimum network
blueprint.
RESULT
“GEOLYTIX have worked with us to create a bespoke
toolset enabling us to proactively set our strategy and
quickly answer any What if scenarios. Their analysis and
recommendations have provided us with a consistent
evidence base from which to make our network
decisions."
STRATEGY
• Create an efficient selection & sales forecasting
process, based on a rigorous, objective fact base and
a consistent approach.
• Understand the drivers and catchments of the
Clothing & Home estate, in order to build optimal
networks.
• Integrate custom models with existing data and
software to create the M&S modelling toolkit.
• Bulk run multiple national and regional scenarios to
guide network strategy and create future blueprints.
EXAMPLE REAL ESTATE ADVIOR PROJECT
OBJECTIVE
Data / analytical support in evaluating potential acquisition opportunities and ongoing asset
management of retail assets.
STRATEGY
• Creation of town centre & grocery gravity models to
asses:
• Catchment profiles and fit to various potential
new occupiers
• Impacts of new greenfield developments and
centre remodels
• Existing retailer chain performance and potential
‘best next’ opportunities
• Ad-hoc consultancy support
• Assisting with major M&A and liquidity event
support
• Detailed asset reports including site visits to
support redevelopment
RESULT
• We provide access to our data and models through
a desktop GIS reporting tool which allows for:
• Ad hoc area demographic reporting
• Retail presence and chain list reports
• Analogue tool to find similar locations
• Drive time reporting
• Bespoke ‘client ready’ site and area reports
WHY GEOLYTIX
World Class Modelling. We have delivered optimisation models for many of the most successful organisations
in the world, across multiple sectors.
Innovation. The Queen’s Award for Innovation reflects our passion for being on the leading edge of new data,
technology, and ideas.
Practical Senior-Level Experience. We are practical operators, with Director-level experience in property teams
of some of the UK’s largest companies.
Technical Expertise. We are experienced data scientists, sales forecasting modellers and spatial web
application developers, and will build a bespoke solution. Every element of our solution, from the analysis to the
platform, will be specifically designed to meet your specific requirements.
We are global. Accounting for the often vast differences in structure, maturity and data availability and quality we
are able to apply a consistent approach across territories in order to support fact-based decisions.
Proven Track Record. We have delivered similar solutions many times before. We will deliver to spec, to time,
and to a fixed budget.
Genuine Partnership. Our commitment is to work closely with you through to deployment, and maintain the
support and relationship beyond.
Thank you!
lrutherford@cartodb.com CEO & Data Scientist at Geolytix jsanchez@cartodb.com

The Role of Data Science in Real Estate

  • 1.
  • 2.
    Introductions Account Executive atCARTO CEO & Data Scientist at Geolytix Solutions Engineer at CARTO
  • 3.
    Which community areyou here from?
  • 4.
    Which use cases are youtypically focused on? Investment Analysis Indoor Mapping Site Planning Market Analysis Trade Area Analysis Pricing Optimization
  • 6.
    APAC 9.8% Industry Participants Seniority Region 43%individual contributors 45% Mid management 12% Senior management North America 47% EMEA 38.6% Latin America 4.5%
  • 7.
    Data Science andGIS teams at organizations
  • 8.
    Data Science andGIS teams at organizations
  • 9.
  • 10.
  • 11.
    Discovering data useful fortheir analysis Evaluating and purchasing data ETLing the data into common structures Analyzing, doing feature extraction and modeling 30% 30% 20% 20% Where do SDS spend time?
  • 12.
    A Data Scientistneeded demographics and zip code data for Portugal to perform a particular market analysis: An example:
  • 13.
    80% of participants believethat it is difficult or very difficult to hire Data Scientists with expertise in spatial analysis 1. Strong background in statistics 2. Extensive experience in coding skills relating to Data Science (Spark, SQL, Python, R, Tensorflow, Pytorch) 3. Experience developing production-quality data products using the results of quantitative research 4. Extensive experience in data visualization (in Python and R or other applications) 5. Effective application of Data Science workflows to business problems, and the ability to storytell around results 6. Familiarity with data pipelines and ETL practices (Airflow, scheduled notebooks, Google DataFlow, etc.) 7. Familiarity with neural networks and deep learning (e.g. Tensorflow, PyTorch) 8. Experience working with distributed computing systems like Spark or Google BigQuery 9. Experience working with GIS software such as CARTO, QGIS, or ArcGIS
  • 14.
    47% of participantsdo not find it challenging to identify the right software & data to support Spatial Data Science projects How difficult Is it to find the right software and data?
  • 15.
    How will investment in SpatialData Science initiatives expand? 68% of organizations are likely to increase their investment in Spatial Data Science in the next 2 years
  • 17.
    New data &new use cases. Let’s discuss!
  • 18.
  • 20.
    Real Estate Meetup JaimeSánchez, Solutions & Customer Success at CARTO
  • 21.
    The Sum ofOur Parts The Complete Journey As an organization, we have defined 5 steps that, together, create a holistic Location Intelligence approach. Our goal is to empower organizations as they traverse each of these 5 steps.
  • 22.
    Spatial analysis in5 key steps: Data Ingestion & Management Data Enrichment Analysis Solutions & Visualization Integration Clean, geocode and visualize your data. Clustering, outliers analysis, time series predictions, and geospatial weighted regression, change spatial support Using 3rd party datasets — ideally on standardized spatial aggregations to reduce your time to insight. WebGL for big datasets, dashboards, widgets, apps. Productize model into a web service API. Feed back into Data Warehouse, LoB systems, consumers, others.
  • 23.
    1. Data Ingestion& Management ● Spatial database with multiple ways to connect and manipulate your data ● Dynamic data in the cloud and multiple data sources: local and remote files, cloud storages, other databases, and more ● Fully managed database with automatic backups and regular upgrades ● Enterprise data sharing and access across CARTO Wide support for geospatial formats (inc. Shapefiles, KML, KMZ, GeoJSON, GPX, OSM, GeoPackage, GDB, CSV, Excel or OpenDocument). Plug ready database connectors (ArcGIS Server, DB Connectors via APIs (MySQL, PostgresSQL, Microsoft SQL Server, Hive on request)).
  • 24.
    2. Data Enrichment ●Save time in gathering spatial data, augmenting your existing data with new location data streams from across the globe ● Create locations from addresses and understand travel time all from within CARTO ● Develop robust ETL processes and update mechanisms so your data is always enriched ● Premium data to understand and analyze deeper trends and behavior
  • 25.
    3. Analysis ● Bringmaps and data into your Data Science workflows and the Python data science ecosystem with CARTOframes ● Machine learning embedded in CARTO as simple SQL calls for clustering, outliers analysis, time series predictions, and geospatial weighted regression ● Use the power of PostGIS and our APIs to productionalize analysis workflows in your CARTO platform
  • 26.
    4. Solutions &Visualization ● Develop and build custom applications with a full suite of frontend libraries. ● Work with CARTO’s Professional Services and Support team as and when you need it. ● Create lightweight, intuitive dashboards for simple sharing of insights across your organization.
  • 27.
    5. Integration ● UsingCARTO’s APIs and SDKs, connect your analysis into the places that matter most for you and your team. ● Bring CARTO to other data destinations, such as desktop GIS and BI tools. ● Embed CARTO inside other tools, such as Salesforce Einstein Analytics or Qlik Sense. ● Work with our Professional Services team for custom configurations or developments.
  • 28.
    Let’s apply thejourney to a real world business question!
  • 29.
    How can weanalyze and understand real estate sales in Los Angeles?
  • 30.
    Pains 1. “Disconnected experiencesto consume data - it is broken into separate tools, teams DBs, excels.” 2. “Limited developer time in our team.” 3. “Current data science workflow doesn’t have a geo focus. and Spatial modeling is cumbersome because I have to export results to XYZ tool in order to visualize and test my model effectively.” 4. “Having trouble handling and visualizing big datasets.“
  • 31.
    Outline the Process 1.Integrate spatial data of past home sales and property locations in Los Angeles county 2. Enrich the data with a spatial context using a variety of relevant resources (demographics, mastercard transactions, OSM) 3. Clean and analyze the data, and create a predictive model for homes that have not sold 4. Present the results in a Location Intelligence solution for users 5. Integrate and deploy the model into current workflows for day to day use
  • 32.
  • 33.
    Integrate LA HousingData The Los Angeles County Assessor's office provides two different datasets which we can use for this analysis: ● All Property Parcels in Los Angeles County for Record Year 2018 ● All Property Sales from 2017 to Present
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    CREATE TABLE la_joinAS SELECT s.*, p.zipcode as zipcode_p, p.taxratearea_city, p.ain as ain_p, p.rollyear, p.taxratearea, p.assessorid, p.propertylocation, p.propertytype, p.propertyusecode, p.generalusetype, p.specificusetype, p.specificusedetail1, p.specificusedetail2, p.totbuildingdatalines, p.yearbuilt as yearbuilt_p, p.effectiveyearbuilt, p.sqftmain, p.bedrooms as bedrooms_p, p.bathrooms as bathrooms_p, p.units, p.recordingdate, p.landvalue, p.landbaseyear, p.improvementvalue, p.impbaseyear, p.the_geom as centroid FROM sales_parcels s LEFT JOIN assessor_parcels_data_2018 p ON s.ain::numeric = p.ain Clean and join the data on unique identifier using SQL
  • 39.
  • 40.
    Integrate LA HousingData Next we want to add spatial context to our housing data to understand more about the areas around: ● Demographics ● Mastercard (Scores and Merchants) (Nearest 5 Areas) ● Nearby Grocery Stores and Restaurants ● Proximity to Roads
  • 41.
    Demographics Add total populationand median income from the US Census
  • 42.
    Mastercard Find the merchantsand sales/growth scores in the five nearest block groups to the home via Mastercard Retail Location Insights data
  • 43.
    ( SELECT AVG(sales_metro_score) FROM ( SELECTsales_metro_score FROM mc_blocks ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom LIMIT 5 ) a ) as sale_metro_score_knn, ( SELECT AVG(growth_metro_score) FROM ( SELECT growth_metro_score FROM mc_blocks ORDER BY la_eval_clean.the_geom <-> mc_blocks.the_geom LIMIT 5 ) a ) as growth_metro_score_knn
  • 44.
    Grocery Stores/Restaurants Find thenumber of grocery stores and restaurants using OpenStreetMap Data and the SQL API.
  • 45.
    ( SELECT count(restaurants_la.*) FROM restaurants_la WHEREST_DWithin( ST_Centroid(la_eval_clean.the_geom_webmercator), restaurants_la.the_geom_webmercator, 1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom))))) ) as restaurants, ( SELECT count(grocery_la.*) FROM grocery_la WHERE ST_DWithin( ST_Centroid(la_eval_clean.the_geom_webmercator), grocery_la.the_geom_webmercator, 1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom))))) ) as grocery_stores
  • 46.
    Roads See if ahome is within one mile of a major highway or trunk highway using the SQL API and major roads from OpenStreetMap.
  • 47.
    ( SELECT CASE WHENCOUNT(la_roads.*) > 0 THEN 1 ELSE 0 END FROM la_roads WHERE ST_DWithin( la_eval_clean.the_geom_webmercator, la_roads.the_geom_webmercator, 1609 / cos(radians(ST_y(ST_Centroid(la_eval_clean.the_geom))))) AND highway in ('motorway', 'trunk') ) as highways_in_1mile
  • 48.
  • 49.
    Analysis The analysis forthis project followed the following steps: ● Moran’s I Clusters & Outliers (Exploratory Data Analysis) ● Neighbor Homes Analysis (Spatial Feature Engineering) ● Predictive Modeling & Hyperparameter Tuning (using XGBoost)
  • 50.
    Moran’s I Using Moran’sI to evaluate spatial clusters and outliers via the PySAL package, we can see these groupings and visualize them in CARTOframes.
  • 51.
    The Sum ofOur PartsThe Sum of Our Parts Moran’s I
  • 52.
    The Sum ofOur Parts Neighbor Analysis Evaluate the attributes of neighbor properties using k-nearest neighbor spatial weights in PySAL to perform spatial feature engineering.
  • 53.
    The Sum ofOur Parts how the attributes of your neighbors influence the price of your home and spatial context…
  • 54.
    The Sum ofOur Parts
  • 55.
    The Sum ofOur Parts Predictive Modeling Using XGBoost we can use this data to create a regression model to predict housing prices and push that data back to CARTO using CARTOframes, never leaving the notebook environment.
  • 56.
    The Sum ofOur PartsThe Sum of Our Parts Sale Price Past Sales Spatial Data Enrichment Spatial Modeling Analyze the values of nearest neighbor sales, clusters of high Mastercard areas, proximity to features Train & Test Model Predictions Spatial Feature Engineering
  • 57.
    The Sum ofOur Parts Predictive Modeling After hyperparameter tuning the model, we can reduce the Mean Average Error down to $58,179.78.
  • 58.
    The Sum ofOur Parts Feature Importance
  • 59.
  • 60.
    The Sum ofOur Parts Solutions To present the data and predictive analysis, both on data from the model that has a sales price and for homes that have not sold, we can develop a location intelligence application to showcase these results.
  • 61.
  • 62.
  • 63.
    Application Development Deploy themodel via a Python based API and sync to data to perform on the fly predictions for specific properties.
  • 64.
    The Sum ofOur Parts Other Use Cases ● Predicting revenue from different physical retail locations ● Identify clusters and groups of specific patterns to optimize activities such as sales outreach or site selection ● Classify property types or buying patterns in a city ● Review spatial feature importance for site performance, and modify models using different spatial components ● find areas with similar behavioral patterns
  • 65.
    Similarity Analysis We builta model to identify areas with similar behavior patterns based on footfall, socio economic and financial data and more. The similarity score is modeled based on: ● Distance between cells is calculated with a L2 norm on a Principal Component space. ● Uncertainty due to missing values and dimension of PC space is tackled following an ensemble probabilistic approach. ● Similarity Score = Continuous Rank Probability Skill Score. By enriching the data with other sources this model can be used for Site Planning, Investment Analysis, etc.
  • 66.
    POPULATION HOUSEHOLD INCOME VISITORSTRANSACTIONS 109 S 5th Street Brooklyn NY 11249
  • 67.
    DEMOGRAPHICS ● Population ● Householdspending ● Household income
  • 68.
    DEMOGRAPHICS ● Number ofvisitors HUMAN MOBILITY
  • 69.
    DEMOGRAPHICS ROAD TRAFFIC ● Numberof vehicles HUMAN MOBILITY
  • 70.
    DEMOGRAPHICS ROAD TRAFFIC FINANCIAL HUMAN MOBILITY ●Ticket size ● Number of Transactions
  • 71.
    ● Offices ● Shops ●Transport POIs DEMOGRAPHICS ROAD TRAFFIC FINANCIAL HUMAN MOBILITY
  • 72.
    Let’s see thePython notebook….
  • 73.
    The Role ofData Science in Real Estate
  • 74.
    Network strategy Location planning Omnichannelanalysis Spatial modelling Our whole business is about location planning. As trusted advisors we help our customers decide how many stores, who to acquire, where to open, which format and how to optimise home delivery and click & collect operations. Team of 36 location specialists to work collaboratively with your business Led in-house location planning for major global retailers. Experts in spatial modelling, forecasting, web development and systems. Create innovative new datasets for local markets. Growing to a global company Offices in London, Leeds, Warsaw, Dortmund, Shanghai, Tokyo and Melbourne INTRODUCTION
  • 75.
  • 76.
    HISTORY Clients Key EventsTeam 2012 Sainsbury’s Whole Foods Foundation 1 2013 ASDA, Boots Waitrose ASDA project transformative, enables growth. Build key datasets. 4 2014 Post Office, Camelot, Barclays New multi-year deals giving confidence. Take office space. Evolve data offer. 6 2015 Amazon, Swinton, Savills Growth in ‘adjacent’ spaces. Invest in capacity and recurring revenue growth. 10 2016 M&S, TRG, EE, On the market Growing & diversifying the client list. Exploring innovative global DAAS solutions. 15 2017 Adidas, Rightmove Dominos Growth in international markets, Shanghai & Tokyo office open. Leeds office opens in the UK. 18 2018 Costa, Dr Martens Warsaw office opens. Launched MAPP, our online map based analytical tool 24 2019 Lego, Starbucks Melbourne office opens Large multi-country, multi-brand advise 35
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
    REAL ESTATE FOUNDATIONAL FEATURES •Decisions are complex and outcomes only become clear over years • Choices are multi-faceted and driven by dynamic competing interests • Key information is tightly held • The amounts of money involved are vast • Decisions are hard to undo • “Retailers make few decisions that are as permanent and unforgiving as selecting store locations.”
  • 84.
    SOME HISTORY SPATIAL DATASCIENCE • William Playfair – 1780s • Charles Minard – 1830s • John Snow – 1850s • Charles Booth – 1890s • Roger Tomlinson – 1960s • Arthur Samuel – 1950s • David Huff – 1970s
  • 85.
    SOME ISSUES UNDERPINNING STATISTICALISSUES Samples are not randomly drawn from the variable space Items from within the sample influence each other Hardly any variables are normally distributed These three features fatally wound pretty much every standard statistical approach WHAT DATA SCIENCE CAN DO Describe things Classify stuff Predict responses What we really want to do is to predict the future
  • 86.
    THE GP ANALOGUE Diagnosethe business problem Bring in the specialists if you need them (e.g. algorithm/model creation) Communicate to business stakeholders Support decision making
  • 87.
  • 88.
    WHAT IS MACHINELEARNING? ● “Machine Learning” refers to the field of study that gives computers the ability to learn without being explicitly programmed (Samuel, 1959) ● In practical terms, a series of different algorithms can be applied to detect patterns in data (including big data), which can lead to actionable insights ● Common machine learning applications include (not extensively): ● Regression Forecasting (e.g. sales forecasts) ● Classification or Clustering (e.g. segmentation or image classification) ● Association Rule Learning (interesting relations; e.g. which other products are you likely to buy based on your other purchases?) ● Reinforcement Learning (e.g. chess AI)
  • 89.
    IT’S NOT NEW ●Machine learning is not a new concept… In 1952 Arthur Samuel wrote the first computer program which learned as it ran ● First neural network to solve a real world problem was designed in 1959 (an adaptive filter to remove echoes from phone lines) ● So if ML isn’t new, why is it becoming so popular now?
  • 90.
    WHY NOW? -MOORE’S LAW
  • 91.
    COMMON TYPES OFMACHINE LEARNING ALGORITHMS ● Supervised Learning: ● The user (human) teaches the algorithm by providing it with input data and a sample of result data (e.g. x = input features, y = actual sales) ● The algorithm then attempts to learn from the input data how best to predict a result (e.g. predict sales) ● Unsupervised Learning: ● The computer is trained with unlabelled data; there is no teacher ● This family of machine learning algorithms is useful for pattern recognition and rule detection ● Semi-supervised Learning: ● A combination of supervised and unsupervised methods ● Reinforcement Learning: ● Maximises reward and minimises risk, iteratively learning from the environment ● Determines ideal behaviour within specific contexts
  • 92.
    USE CASE FORWITHIN LOCATION PLANNING ● So what’s the catch!? ● How can we use this in location planning/real estate!? • • • • • • •
  • 93.
  • 94.
    EXAMPLES OF USECASES ● Using K-nearest neighbour to create demographic segmentations, based on known customer data ● Learning about key drivers of success by examining feature importance ● Building forecasting models to predict sales based on property location ● Using NLP for categorising customer comments ● Etc. One of the more interesting solutions we’ve used recently combines traditional methods with machine learning…
  • 95.
    GROCERY GRAVITY MODEL ●Gravity models are common practice within the grocery retail location planning/real estate place ● It is important for grocers to understand which locations would be ideal for a new supermarket, but also to understand the impact this might have on existing locations and competitors… Gravity Model in a nutshell: ● Based on theory of gravity ● More attractive destinations have a greater ‘pull’ ● Attraction is linked to distance ● Using customer data we know how far people actually travel to their chosen stores ● Fundamental concept is logical, and simple to understand
  • 96.
    GROCERY GRAVITY MODEL ●Gravity models are often very accurate at estimating customer patterns and interactions at close range… ● However, this accuracy usually wanes as you try to model sales from further afield: ● Consumers decisions are much harder to understand ● Consumers have more choice ● Are they workers or residents? ● Decision is not as simple as “I’ll just pop into my nearest, most attractive supermarket”…
  • 97.
    GROCERY GRAVITY MODEL ●The solution… To use machine learning to create an estimate for ‘Sales beyond 30mins’ ● Created a datamart for each property in the portfolio (see opposite) and tested various machine learning algorithms to see if we could more accurately predict sales than previously ● Eventually settled on a neural network It’s not as easily interpretable, but gives better results on interactions which are inherently difficult to understand anyway! √ +20% of store sales beyond 30 minutes drivetime were more accurately predicted √ R² increased by 0.25 for beyond sales
  • 98.
  • 99.
    SOME OF OURCUSTOMERS
  • 100.
    EXAMPLE FASHION CLIENT OBJECTIVE Withsix stores operating in Hong Kong, Dr. Martens wanted to understand how high the achievable turnover is at each location. Additionally, an understanding of the best locations for new stores was required as part of a future store investment roadmap. RESULT We created new datasets and a bespoke model to calculate sales potentials for the existing store network. The model was then used in a future opportunity scan to identify the best locations for new stores in Hong Kong. STRATEGY MODEL DEMAND: Calculate how much people spend on footwear at the lowest possible geography MAP THE RETAIL LANDSCAPE: Understand the locations where retailers cluster in Hong Kong SALES POTENTIAL: Calculate how much turnover is achievable at a retail venue (e.g. Mall) and individual store level OPPORTUNITY SCAN: Use the developed model and data sets to find ideal locations for the next Dr. Martens stores in Hong Kong
  • 101.
    STRATEGY • Understand thetrue drivers of store performance & the impact on nearby stores of opening new sites. • Predict new store sales and cannibalisation using a consistent, transparent fact base and model. • Improve the efficiency of the store forecasting process to allow more time for the value-add. • Deliver the ideal network blueprint and optimum network strategy. EXAMPLE F&B CLIENT "Our work with GEOLYTIX has enabled us to form a consistent approach to new site forecasting, step changing our understanding of customers catchments and improving our ability to understand regional and store performance. The collaborative approach has resulted in us being able to make decisions around our future location strategy and form ideal network blueprints with significantly increased confidence.” Craig Donnellan, Head of Location Planning Dominos Pizza. OBJECTIVE Support the Dominos strategy to be the number one pizza company in each neighbourhood with a focus on franchisee profitability. RESULT
  • 102.
    EXAMPLE RETAIL CLIENT:FOOD, FASHION & HOME OBJECTIVE Support a step change in the roll-out of the Food estate, understand the drivers of performance for the Clothing & Home estate and recommend the optimum network blueprint. RESULT “GEOLYTIX have worked with us to create a bespoke toolset enabling us to proactively set our strategy and quickly answer any What if scenarios. Their analysis and recommendations have provided us with a consistent evidence base from which to make our network decisions." STRATEGY • Create an efficient selection & sales forecasting process, based on a rigorous, objective fact base and a consistent approach. • Understand the drivers and catchments of the Clothing & Home estate, in order to build optimal networks. • Integrate custom models with existing data and software to create the M&S modelling toolkit. • Bulk run multiple national and regional scenarios to guide network strategy and create future blueprints.
  • 103.
    EXAMPLE REAL ESTATEADVIOR PROJECT OBJECTIVE Data / analytical support in evaluating potential acquisition opportunities and ongoing asset management of retail assets. STRATEGY • Creation of town centre & grocery gravity models to asses: • Catchment profiles and fit to various potential new occupiers • Impacts of new greenfield developments and centre remodels • Existing retailer chain performance and potential ‘best next’ opportunities • Ad-hoc consultancy support • Assisting with major M&A and liquidity event support • Detailed asset reports including site visits to support redevelopment RESULT • We provide access to our data and models through a desktop GIS reporting tool which allows for: • Ad hoc area demographic reporting • Retail presence and chain list reports • Analogue tool to find similar locations • Drive time reporting • Bespoke ‘client ready’ site and area reports
  • 104.
    WHY GEOLYTIX World ClassModelling. We have delivered optimisation models for many of the most successful organisations in the world, across multiple sectors. Innovation. The Queen’s Award for Innovation reflects our passion for being on the leading edge of new data, technology, and ideas. Practical Senior-Level Experience. We are practical operators, with Director-level experience in property teams of some of the UK’s largest companies. Technical Expertise. We are experienced data scientists, sales forecasting modellers and spatial web application developers, and will build a bespoke solution. Every element of our solution, from the analysis to the platform, will be specifically designed to meet your specific requirements. We are global. Accounting for the often vast differences in structure, maturity and data availability and quality we are able to apply a consistent approach across territories in order to support fact-based decisions. Proven Track Record. We have delivered similar solutions many times before. We will deliver to spec, to time, and to a fixed budget. Genuine Partnership. Our commitment is to work closely with you through to deployment, and maintain the support and relationship beyond.
  • 105.
    Thank you! lrutherford@cartodb.com CEO& Data Scientist at Geolytix jsanchez@cartodb.com