A New Venue for Toronto: IBM Data Science Capstone Project by F. F. Mulks

A New Venue for Toronto
Capstone Project for IBM Data Science Professional Certificate via Coursera

ffmulksin
This is a presentation of my capstone project for the IBM
Data Science Professional Certificate via Coursera.[1]
IBM DATA SCIENCE PROFESSIONAL
The code can be found in form of a Jupyter notebook on
my Github account:
https://github.com/ffmulks/data...certification
CODE AVAILABILITY
Florian is a chemistry researcher who strives to use data-
driven decisions to fast-forward experimental and
computational chemical research processes.
DR. FLORIAN F. MULKS
Introduction 2

ffmulksin
Data science is the study of large quantities of data, which
can reveal insights that help make strategic choices.
WHAT IS DATA SCIENCE?
Data scientists need to be curious, judgemental and
argumentative. Finding a good story in the data and
telling it well is as important as data treatment.
DATA SCIENTISTS
Data science is a multi-disciplinary field influenced by
computer science, software engineering, mathematics,
statistics, economics and business management and uses
tools from all of these.
METHODS
Data Science 3

ffmulksin
4
Types of Machine Learning
Dimensionality Reduction
Clustering
Machine Learning
Unsupervised Learning
Supervised Learning
Reinforcement Learning
Regression
Classification
QUICK REMINDER
A broad set of machine learning methods was applied in
the scope of this project. Justification and further
explanation is given in the discussion of our results.
This is an overview of the available algorithm groups
classified as machine learning methods. Please refer to
the course material for further details.[1]
USED WITHIN THIS PROJECT

ffmulksin
5
Methodology
CRoss-Industry Standard Process for Data Mining (1996),
project by the EU and five companies. Most widely used
analytics model.
CRISP-DM, 1996
More refined 10-step version of CRISP-DM of an IBM data
scientist. He adds analytic approach, data collection, data
understanding and feedback.
JOHN ROLLINS’ MODEL, 2015
Analytics Solutions Unified Method for Data
Mining/Predictive Analytics, by IBM. Especially refines the
model around reiteration.
ASUM-DM, 2015
Evaluation
Business
Understanding
Data
Understanding
Deployment
Data
Preparation
Modeling
Data

ffmulksin
6
Business Understanding
Investors or city planners are looking to open a venue in
Toronto that is profitable and contributes toward the local
cultural infrastructure.
LAUNCH ANY VENUE
We are not posing any prior limitations regarding the
location or type of venue. The decision will be purely
based on data-driven predictions.
DECISION CIRCUMSTANCES
Toronto is a cultural melting pot among the top 10 most
populated cities in North America. Rich diversity is found
in venues and citizens.
TORONTO, CANADA
6
Ref. [2]

ffmulksin
7
Literature Overview
There are plenty of analyses of Toronto’s venue and
population structure due to the privileged role as
standard example in the course capstone.
DATA SCIENCE CAPSTONE
Available studies aim at opening specific venue types,[3] at
finding best places to live,[4] or at declaring an area the
“best” neighborhood.[5]
AIM OF PREVIOUS STUDIES
Most studies start with clustering postal code areas and
proceed to derive decisions based on bar graphs of some
additional data sources based on their goal.
METHODS SUMMARY
7
Ref. [2]

ffmulksin
8
Toronto Postal Codes and Venues
We will use the 103 postal codes starting with M in
Toronto, Canada, scraped from Wikipedia together with
assigned boroughs and neighborhoods.
DATA SET
Venue data was requested from the Foursquare API
within 500 m radii of postal code centroids. A limit of 100
venues per postal code request was applied.
VENUES
The collected dataset contains names, addresses,
geographic locations, and venue categories. We have
extracted 2130 venues from 271 categories spread over
103 areas defined by 500 m radii around the center of
postal codes.
DATA OVERVIEW

ffmulksin
9
Geographic Venue Distribution
We computed the total number of venues and the mean
distances between them by calculating the earth surface
distances (Haversine distances).
FEATURE CONSTRUCTION
Most investigated 500 m radii only contain less than 20
venues. Venue distances appear to be rather invariant
among the different postal code areas.
DISTRIBUTION
Models are expected to be of limited representativeness
as limited data amounts per postal code radius exist.
FIRST UNDERSTANDING
Click buttons to control figure display:
distribution plot box plots 

ffmulksin
10
Venue Density Distribution
High venue counts should lead to high venue densities
which could be a good measure of competitiveness but
also lucrativity of a location.
VENUE DENSITY
Due to little data of neighborhoods with high venue
counts, no good correlation can be found. A slight
decrease in density is found with higher counts.
DISTRIBUTION
Distances seem broadly spread in the low venue count
neighborhoods. The mean actually slightly increases with
increasing venue counts. Let us see if we can model the
structure of the venue density.
WEAK CORRELATION EXPECTED

ffmulksin
11
Venue Density Regression
Linear regression was applied to the venue density and
we created polynomial features of the distances and
venue counts to investigate the data structure.
REGRESSION MODELS
A poor correlation with R2 0.01 is found with a linear
model, the cubic model delivers an R2 score of 0.06 (only
explaining 6% of the data variation).
CORRELATION
We cannot employ the venue density for a priori binning
of our data, but the distance between the venues seems a
valuable feature to keep for further analyses.
FEATURE TREATMENT
cubic regression linear regression 

ffmulksin
12
Venue Categories Distribution
Foursquare defines categories to explain what exactly the
customer can expect when visiting a venue.
FOURSQUARE CATEGORIES
The categories are extremely detailed. The vast majority
of the 271 venue categories occurs only once in the
whole observed region in Toronto.
DISTRIBUTION
Clustering and Principal Component Analysis (PCA) will be
needed to reduce the dimensionality. 271 categories with
only 103 samples (postal codes) make for ill-defined
matrices reducing the choice of applicable algorithms.
HIGH DIMENSIONALITY

ffmulksin
13
Clustering Algorithms
Several algorithms and initialization types were employed
to find similarities in neighborhoods based on their venue
counts and categories.
CLUSTERING
Most algorithms only allocate downtown areas in smaller
clusters. The vast majority is assigned to one or
sometimes two “low venue count” clusters.
DISCOVERED PATTERNS
K-Means (k-means++ initialization) was chosen for further
investigation as it was capable of showing some
structures even outside of downtown Toronto.
ALGORITHM CHOICE
K-Means (Random,
PCA)
K-Means (Random)K-Means (k-means++)DBSCANAgglomerative Clustering
Click buttons to change algorithm:
k = 52 clusters found

ffmulksin
14
K-Means Number of Clusters k
More clusters will always capture more variation of the
data. The cluster number at which the steepness of this
accuracy gain drops is usually chosen.
ELBOW METHOD
While we find the correct location of downtown
neighborhoods at all k, the elbow plot shows no good k
due to high similarity of most areas.
RESULTS
Due to the high similarity of our observed areas,
clustering does not deliver useful insights with the
employed data set.
HIGH AREA SIMILARITY
9− 7 +− 5 +− 3 +
elbow plot  choose k

ffmulksin
15
2d-Principal Component Analysis (PCA)
The axes are found to correlate to venue counts from
certain categories, indicating similarities between venues
of certain categories.
NEIGHBORHOODS
The matrix was transposed to find structures in our
categories rather than areas. The x-PCA represents high
density, the y-PCA low density neighborhoods.
CATEGORIES
Certain venues are likely to appear in certain density
neighborhoods. PCA is valuable for grouping the
categories. We will reduce the dimensionality of our data
set to 30 (11% of 271 categories) to create more
meaningful variables.
DATA STRUCTURE
categories neighborhoods 

ffmulksin
16
Model Development
The difference between true and MLR-based predictions
for venue counts were used to evaluate the demand for
venues in Toronto neighborhoods.
PREDICTIVE MODEL
Categories were clustered to identify similar venue types
catering the same needs. These were used to correct for
such saturated demands.
REAL DEMAND CORRECTION
The resulting predictions show both the location and the
category for venues that are in demand in Toronto.
DEMAND PREDICTION
Real Venue
Demand
Venue Counts in
Categories
Category Clusters
Venue Demand
Predicted Venue
Counts
Cluster Demand
Predicted Cluster
Venue Counts
K-Means
Multiple Linear
Regression
Difference
True vs. Predicted
Product
Venue x Cluster

ffmulksin
17
Demand Prediction
MLR with 5-fold cross-validation was performed to predict
every single of our 273 variables (271 categories + venue
count + mean distance).
MULTIPLE LINEAR REGRESSION (MLR)
The resulting prediction matrix emulates average healthy
Toronto neighborhoods based on the number of
neighboring venues of certain categories.
VENUE COUNT PREDICTION
Differences between true and predicted values can be
employed to predict market oversaturation and, more
importantly, market demand for venue types.
VENUE DEMAND PREDICTION

ffmulksin
18
Category Clustering
Some categories are very similar. Predicted coffee shop
demands might fully be saturated due to existing cafés.
DEMAND SATURATION
We used K-Means clustering with our transposed data to
find 30 category clusters to reduce the dimensionality of
our data.
DIMENSIONALITY REDUCTION
The clusters nicely capture other venues that might
satiate the needs for e.g. coffee such as hotels,
restaurants and also some surprising categories.
CLUSTERS CAPTURE INTERACTIONS
General Travel, Modern European Restaurant,
Steakhouse, Restaurant, Plaza, Cuban Restaurant, Gift
Shop, New American Restaurant, Brazilian Restaurant,
Japanese Restaurant, Smoke Shop, French Restaurant,
Pub, Art Gallery, Shopping Mall, Speakeasy, Tea Room,
Italian Restaurant, Salon / Barbershop, Food Court,
Vegetarian / Vegan Restaurant, Seafood Restaurant,
Concert Hall, Nightclub, Gluten-free Restaurant, Soup
Place, American Restaurant, Bakery, Department Store,
Gastropub, Hotel, Coffee Shop, Opera House, Food Truck,
Lounge, Asian Restaurant, Art Museum, Cupcake Shop,
Train Station, Beer Bar, Colombian Restaurant, Café,
Record Shop, Bookstore, Deli / Bodega, Building, Men’s
Store, Fast Food Restaurant, Wine Bar, Dog Run,
Monument / Landmark, Museum, Thai Restaurant, Salad
Place
COFFEE SHOP CLUSTER

ffmulksin
19
Demand in Venue Categories
We analyzed the demand as difference between
predicted and true venue counts over the 30 category
clusters.
CLUSTER DEMAND
The highest demand for venue is found in the cluster
containing coffee shops. The map shows neighborhood
demands in the coffee cluster.
COFFEE CLUSTER HIGHLY DEMANDED
Most suburban areas have a fulfilled demand in the
coffee cluster. We found that many downtown areas
would support several more venues in the cluster.
STRONG DEMAND IN DOWNTOWN AREA
bar chart−+zoom 
coffee cluster

ffmulksin
20
Coffee Cluster Demand in Neighborhoods
The coffee cluster showed the highest demands over all
clusters. We now looked at the distribution of this
demand over the different locations.
DEMAND LOCATION
Toronto seems to have a healthy amount of venues in the
category but their location is suboptimal with both over-
and undersaturated neighborhoods.
DEMAND DISTRIBUTION
There are several neighborhoods that strongly demand
more venues in the coffee cluster. Let us look into the
details to find out which venue category would be the
most lucrative to launch.
AREAS WITH HIGH DEMAND FOUND

ffmulksin
21
Coffee and Café Demand
Coffee shops and cafés are demanded the most within
their cluster. Their combined demand is shown in the
maps.
CATEGORY DEMAND
Toronto needs coffee! The two almost identical categories
coffee shops and cafés are in extreme demand especially
in the Church and Wellesley area.
COFFEE DEMAND
We now need to make sure that this neighborhood is not
saturated with e.g. tea rooms and other venues that serve
coffee as well.
INTERACTION WITH OTHER VENUES
bar chart−+zoom 

ffmulksin
22
Real Demand
The product of the market demands (negative market
saturation) of the whole cluster with the demand of only
coffee shops and cafés was taken.
DEMAND PRODUCT
In some cases, the coffee demand is largely saturated by
other venues (larger yellow than red circle). One extreme
demand product is found.
SATURATION OF COFFEE DEMAND
The neighborhood Church and Wellesley lacks 4.5 coffee
shops/cafés and is even lacking 6.0 venues from the
whole cluster which yields an outstanding demand
product of 26.7.
OUTSTANDING DEMAND FOUND
Cafés:
Cluster:
Product:
2.1
5.3
10.9
Cafés:
Cluster:
Product:
4.5
6.0
26.7
Cafés:
Cluster:
Product:
1.4
6.2
8.8
Cafés:
Cluster:
Product:
1.8
4.2
7.5
Cafés:
Cluster:
Product:
4.5
6.0
26.7
−+zoom

ffmulksin
Among all venue categories, coffee shops/cafés were
identified by linear regression to be in extreme demand
throughout many central Toronto areas.
HIGH COFFEE DEMAND
Venue clustering showed that there is an extreme coffee
demand in the Church and Wellesley area that even lacks
further venues of similar categories.
BEST LOCATION FOR A STORE
Launch a coffee shop in Church and Wellesley. The area
lacks 4.5 coffee shops and a grand total of 6 venues
catering the need for coffee. This should be very lucrative!
ACTION RECOMMENDATION
Conclusions 23
Ref. [6]

ffmulksin
24
Thanks for Reading!
Thank you for your attention. If you have any comments,
please get in touch. Constructive criticism is always
welcome.
This project was done to learn the ropes of data science
for application in my research aiming to enable simple
and semi-automatic computational chemical modelling
for experimental researchers. If you are interested in my
work, feel free to contact me or check out my homepage
and social media.
GET IN TOUCH
inff@mulks.ac
mulks.ac ffmulks
ffmulks
24
Ref. [6]

ffmulksin
25
References
https://www.coursera.org/professional-certificates/ibm-data-science,
07/08/2020.
[1] Course material:
Wladyslaw Sojka via http://www.sojka.photo, 07/08/2020.
[2] Toronto skyline photography:
https://towardsdatascience.com/exploring-toronto-neighborhoods-to-
open-an-indian-restaurant-ff4dd6bf8c8a,
https://capstoneprojectcoursera.wordpress.com/, 07/08/2020.
[3] Best place to open specific venue:
https://medium.com/...-ibm-capstone-project-52b4292ef410,
https://www.linkedin.com/pulse/capstone-project-battle-
neighborhoods-rohitaksh-gs/, 07/08/2020.
[4] Rental or personal flat:
https://github.com/gnavia007/Coursera_Capstone/,
http://roshangrewal.com/...finding-a-better-place-in-scarborough-
toronto/, 07/08/2020.
[5] Best neighbourhood:
Mike Kenneally via https://unsplash.com, 07/08/2020.
[6] Coffee photography:

A New Venue for Toronto: IBM Data Science Capstone Project by F. F. Mulks

Recommended

Recommended

More Related Content

Similar to A New Venue for Toronto: IBM Data Science Capstone Project by F. F. Mulks

Similar to A New Venue for Toronto: IBM Data Science Capstone Project by F. F. Mulks (20)

Recently uploaded

Recently uploaded (20)

A New Venue for Toronto: IBM Data Science Capstone Project by F. F. Mulks