Machine Learning over Small Data: generating new variables with predictive power

•

0 likes•26 views

Data Con LA 2020 Description Analysts often find themselves lacking the data they want for training a specific model either by being short of records or by these not being rich enough. Exposing some of the common challenges faced when there is not enough data, we will walk through the creative process in training a real-life not-enough-data forecasting demand model where we had to deal with both cases. Using as input just the weekly sales report of 72 restaurants of a renowned fast food franchise we successfully trained a model for choosing the best coordinates to open new restaurants. In the process we faced many common challenges that we addressed in creative and non-orthodox ways that we will be sharing with the audience. Starting with the reduced amount of data that was available for training the model, we complemented the coordinates of the actual points of sales with info from businesses and points of interest that were around them, leveraging on Google Places. Then we translated all that sparse information into more than 300 numeric variables that all together would describe the environment where the restaurants were located based on the kinds of establishments that were around them. We defined our very own formula that we got to name economic concentration index. Finally, we applied some advanced techniques including Ridge Lasso, backward and forward selection, PCA, and Cross Validation to reduce dimensionality and train an surprisingly good linear regression model that would have a residual error fluctuating between 10 and 300 transactions over an average of 2,800 transactions per restaurant per week. With this model we are able to correctly choose 9 out of 10 times the best place to open a new restaurant based only on the given coordinates A and B. *Training without big data * Enrich data with other sources *Creating numeric variables from descriptive sparse information *Modeling with linear regression *Forecasting demand Speaker Luis Valdeavellano, Martinexsa, Data Scientist

Data & Analytics

Machine
Learning over
Small Data:
Generating New Variables
With Predictive Power
Luis Valdeavellano
Data Con LA 2020

About Me
Computer Science Engineer
5+ years in Big Data / Data Science
projects in Latin America
Professor at Universidad del Valle de
Guatemala
Cooking / Pizzaiolo / Music /
Photography

Creativity
Data Scientist’s main asset
Machine Learning over Small Data

Sales: 2361
Sales: 1783
Option A
Option B
Site Location

Define a Roadmap
Formulate Hypothesis
Restaurant sales are
positive affected by
existing in a crowded
place with specific
categories of
businesses and public
spaces.

Average distance of
places to centroid
Distance from point of
interest to centroid
vs.

Average distance of
places to centroid
Distance from point of
interest to centroid
vs.
ρ =
d =
Economic Concentration
Index

Radius-defined ICE
nearest
10
nearest
20
nearest
50
nearest
80
nearest
100
Nearest-defined ICE
200 m 500 m 800 m 1 km 2 km
schools
churches
restaurants
gas stations
336 variables
generated for a single
point of interest
ICE Application over data

Overfitting and
Feature Engineering
PCA
Ridge & Lasso
Backward &
Forward
Selection

Results
Variable P-value
shopping_mall_count_500m +1.74%
hospital_count_1km +1.45%
ice_800m +1.09%
parking_min_distance - 0.87%
ice_fast_food_500m - 0.62%
bus_station_min_distance - 0.48%
ice_atm_top10 +0.11%
Predicted R2
72.34%
RMSE
382.4

Wrap Up
Don’t be afraid of
being creative
Define a
hypothesis
Analyze behavior
by cases
Create your own
function
Don’t overfit your
model
Leverage on existing
algorithms

Luis Valdeavellano
lvaldeavellano@martinexsa.com
linkedin.com/in/lvaldeavellano/
ThankYou!

Similar to Machine Learning over Small Data: generating new variables with predictive power

Retail Analytics: Creating Location-Driven Performance Reports in Power BIGaligeo

Goldberg-AnalyticsBOLO Conference

Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!

Portfolioscot scollon

Optimize Omnichannel Campaigns In 2018: How To Reach Every Buyer In Every Cha...G3 Communications

Demo for presentationsMeigsgibbons

Building the Cognitive Era : Big Data StrategiesKevin Sigliano

151111 BASE ELN 151112 CIO Big Data CollaborationDr. Bill Limond

Emerging TechnologyKrishnan Viswanath

metadata.io - an overviewmetadata.io

Big DataParadigm, Challenges, Analysis, and ApplicationUyoyo Edosio

Beyond basic data: Unleash the power of your business data with geospatial in...Precisely

WP 15 Minute Guide to Onboarding_7.16.15Angela Wang

Big Data Hype (and Reality) Srijani Das

Blurring the Lines Between TV and Digital MediaPost

eCommerce van a tot z - 21 juni 2017 part GoogleMapsSuivo NV

DIGITAL TRANSFORMATION IN THE FOOD & BEVERAGE INDUSTRYAbubakr Asif

Qatar Big Data & Analytics Summit Brouchre15th - 17th May 2016Kumaraguru Ramanujam

Big Data: Real-life Examples of Business Value GenerationCapgemini

Mobility Data and Solutions, By Lauren MooresDstillery

Similar to Machine Learning over Small Data: generating new variables with predictive power (20)

Retail Analytics: Creating Location-Driven Performance Reports in Power BI

Goldberg-Analytics

Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...

Portfolio

Optimize Omnichannel Campaigns In 2018: How To Reach Every Buyer In Every Cha...

Demo for presentations

Building the Cognitive Era : Big Data Strategies

151111 BASE ELN 151112 CIO Big Data Collaboration

Emerging Technology

metadata.io - an overview

Big DataParadigm, Challenges, Analysis, and Application

Beyond basic data: Unleash the power of your business data with geospatial in...

WP 15 Minute Guide to Onboarding_7.16.15

Big Data Hype (and Reality)

Blurring the Lines Between TV and Digital

eCommerce van a tot z - 21 juni 2017 part GoogleMaps

DIGITAL TRANSFORMATION IN THE FOOD & BEVERAGE INDUSTRY

Qatar Big Data & Analytics Summit Brouchre15th - 17th May 2016

Big Data: Real-life Examples of Business Value Generation

Mobility Data and Solutions, By Lauren Moores

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Data-Analysis for Chicago Crime Data 2023ymrp368

Halmar dropshipping via API with DroFxolyaivanovalion

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Carero dropshipping via API with DroFx.pptxolyaivanovalion

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

April 2024 - Crypto Market Report's Analysismanisha194592

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

100-Concepts-of-AI by Anupama Kate .pptx

Smarteg dropshipping via API with DroFx.pptx

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

CebaBaby dropshipping via API with DroFX.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Log Analysis using OSSEC sasoasasasas.pptx

Brighton SEO | April 2024 | Data Storytelling

Data-Analysis for Chicago Crime Data 2023

Halmar dropshipping via API with DroFx

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Carero dropshipping via API with DroFx.pptx

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati

April 2024 - Crypto Market Report's Analysis

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

BigBuy dropshipping via API with DroFx.pptx

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Machine Learning over Small Data: generating new variables with predictive power

1. Machine Learning over Small Data: Generating New Variables With Predictive Power Luis Valdeavellano Data Con LA 2020

2. About Me Computer Science Engineer 5+ years in Big Data / Data Science projects in Latin America Professor at Universidad del Valle de Guatemala Cooking / Pizzaiolo / Music / Photography

3. Creativity Data Scientist’s main asset Machine Learning over Small Data

4. Sales: 2361 Sales: 1783 Option A Option B Site Location

5. Sales: 2361 Sales: 1783 Option A Option B Site Location

8. Define a Roadmap Formulate Hypothesis Restaurant sales are positive affected by existing in a crowded place with specific categories of businesses and public spaces.

9. Average distance of places to centroid Distance from point of interest to centroid vs.

10. Average distance of places to centroid Distance from point of interest to centroid vs. ρ = d = Economic Concentration Index

11. Radius-defined ICE nearest 10 nearest 20 nearest 50 nearest 80 nearest 100 Nearest-defined ICE 200 m 500 m 800 m 1 km 2 km schools churches restaurants gas stations 336 variables generated for a single point of interest ICE Application over data

12. Overfitting and Feature Engineering PCA Ridge & Lasso Backward & Forward Selection

13. Results Variable P-value shopping_mall_count_500m +1.74% hospital_count_1km +1.45% ice_800m +1.09% parking_min_distance - 0.87% ice_fast_food_500m - 0.62% bus_station_min_distance - 0.48% ice_atm_top10 +0.11% Predicted R2 72.34% RMSE 382.4

14. Wrap Up Don’t be afraid of being creative Define a hypothesis Analyze behavior by cases Create your own function Don’t overfit your model Leverage on existing algorithms

15. Luis Valdeavellano lvaldeavellano@martinexsa.com linkedin.com/in/lvaldeavellano/ ThankYou!

Machine Learning over Small Data: generating new variables with predictive power

Recommended

Recommended

More Related Content

Similar to Machine Learning over Small Data: generating new variables with predictive power

Similar to Machine Learning over Small Data: generating new variables with predictive power (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Machine Learning over Small Data: generating new variables with predictive power