8. Transformation
Transform relational data into vectors
All algos need: matrices of numbers
Some need
0.0 ≤ x ≤ 1.0
mean=0
σ=1
Look out for algos requiring „normalized“ or
„standardized“ values → feature scaling
9. Categories
• Features with no numerical relation
• Category 5 doesn’t have 5x the y of category 1
• Fix: Dummy variables
• cat_1, cat_2, … cat_5 with values 0 or 1
10. Missing Values
• days_since_last_purchase = null
How to deal with this?
0 or 999?
• Often intuitively clear from the data domain
One solution:
max(days_since_last_purchase of other users)
• HAS to be addressed
11. Outliers
• days_since_last_purchase = 2837
for a legacy customer
• If it’s irrelevant, get rid of the whole example
(legacy customer)
• Or cap at a max/min value
12. Reduce Features
• check for correlation between features.
get rid of correlated ones
• get rid of intuitively useless features
13. A Better Model
• Less features - i.e. is simpler
• Trained on more training examples
16. Online vs Offline
OFFLINE
From time to time retrain whole model and upload
model
ONLINE
Algorithm runs each time a new example is added
and adapts the model a bit
examples should be randomized
18. Build Model
• Collect data
Traffic source, categories looked at prior to signup,
etc. and y = category of purchase after signup
• Analyze
Try to make predictions using e.g. logistic
regression
• Train final model
• Save weights to DB or JSON or file
19. Predict
• User signs up
• Load weights and predict probabilities of
categories.
• If P(category X) > threshold
classify user as „interested in category X“
• Send out newsletters
20. Tips
• Use R or Python/Jupyter/Pandas to analyze data
• Test if you need a separate system for predictions
or just for training
• Try not to implement algos yourself
If you do, use numerical computation libraries
(probably wrappers for C or Fortran code)
• Be sure the past predicts the future
21. Ethics
• Your model might turn into a racially profiling sexist.
• Be aware of what your input features mean &
what you actually base your predictions on
• Relatively harmless when predicting product
categories - questionable for credit ratings