© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alexandra Johnson - Software Engineer, SigOpt
alexandra@sigopt.com
Twitter: @alexandraj777
Machine Learning Fundamentals
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Is Machine Learning?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Don't say "machine" or "learning")
What Is Machine Learning?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Don't say "machine" or "learning")
A solution to a problem that improves with data
What Is Machine Learning?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Don't say "machine" or "learning")
A solution to a problem that improves with data
Data: emails, articles, images, list of homes
What Is Machine Learning?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
(Don't say "machine" or "learning")
A solution to a problem that improves with data
Data: emails, articles, images, list of homes
Problem: label an email as spam (classification), predict a
home's price (regression), and others
What Is Machine Learning?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Problem: quickly identify if an email is spam or not spam
● Data: a list of emails, a list of "labels" spam or not spam
● Goal: function that will correctly label never-before-seen
emails as spam or not spam
Example: Classify Spam Emails
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Pick a model: xgboost, random forest, mxnet CNN, etc
● Transform your data to be readable by the model
● Feature engineering: explore your data to pick out
information you is important
Build - Train - Tune - Deploy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Model: random forest
● Features: percentage of misspelled words, number of
words from a blacklist, domain name of email sender
Build - Train - Tune - Deploy Example
def extract_features(email):
return [
email.mispelled_words,
email.words_on_blacklist,
email.sender.domain,
]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Expose the model to your data so it can better solve your
problem
● Think of a model as a class, this method has already
been implemented
● Compute intensive, best done on a server
Build - Train - Tune - Deploy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
● Model: random forest
● Features: percentage
of misspelled words,
number of words from
a blacklist, domain
name of email sender
Build - Train - Tune - Deploy Example
email_features = [
[0.1, 1, 'hotmail.com'],
[0.7, 20, 'gmail.com'],
[0.3, 92, 'yahoo.com'],
]
labels = [0, 1, 1]
model = RandomForest()
model.train(email_features, labels)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build - Train - Tune - Deploy
● Models have tunable knobs, aka "hyperparameters"
● Different hyperparameters = different performance
● Train data set for training, validation data set for
measuring performance
● Overfitting: your model is really good on your old data, but
really bad on never-before-seen data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build - Train - Tune - Deploy Example
def evaluate(num_leaves, max_depth):
train_data, train_labels, validation_data, validation_labels
= split(email_features, labels)
model = RandomForest(num_leaves=num_leaves, max_depth=max_depth)
model.train(train_data, train_labels)
validation_score = model.score(validation_data, validation_labels)
return validation_score
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build - Train - Tune - Deploy
● We train our model to solve our problem on old data but
we really want to solve our problem on new data
● Create a REST endpoint for accessing the model
● A/B test different versions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
model = RandomForest(best_hyperparameters)
model.train(emails, labels)
def is_spam(email):
email_features = extract_features(email)
return model.predict(email_features)
Build - Train - Tune - Deploy Example
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thanks! Questions?
alexandra@sigopt.com
Twitter: @alexandraj777

Machine Learning Fundamentals

  • 1.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Alexandra Johnson - Software Engineer, SigOpt alexandra@sigopt.com Twitter: @alexandraj777 Machine Learning Fundamentals
  • 2.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. What Is Machine Learning?
  • 3.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. (Don't say "machine" or "learning") What Is Machine Learning?
  • 4.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. (Don't say "machine" or "learning") A solution to a problem that improves with data What Is Machine Learning?
  • 5.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. (Don't say "machine" or "learning") A solution to a problem that improves with data Data: emails, articles, images, list of homes What Is Machine Learning?
  • 6.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. (Don't say "machine" or "learning") A solution to a problem that improves with data Data: emails, articles, images, list of homes Problem: label an email as spam (classification), predict a home's price (regression), and others What Is Machine Learning?
  • 7.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. ● Problem: quickly identify if an email is spam or not spam ● Data: a list of emails, a list of "labels" spam or not spam ● Goal: function that will correctly label never-before-seen emails as spam or not spam Example: Classify Spam Emails
  • 8.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. ● Pick a model: xgboost, random forest, mxnet CNN, etc ● Transform your data to be readable by the model ● Feature engineering: explore your data to pick out information you is important Build - Train - Tune - Deploy
  • 9.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. ● Model: random forest ● Features: percentage of misspelled words, number of words from a blacklist, domain name of email sender Build - Train - Tune - Deploy Example def extract_features(email): return [ email.mispelled_words, email.words_on_blacklist, email.sender.domain, ]
  • 10.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. ● Expose the model to your data so it can better solve your problem ● Think of a model as a class, this method has already been implemented ● Compute intensive, best done on a server Build - Train - Tune - Deploy
  • 11.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. ● Model: random forest ● Features: percentage of misspelled words, number of words from a blacklist, domain name of email sender Build - Train - Tune - Deploy Example email_features = [ [0.1, 1, 'hotmail.com'], [0.7, 20, 'gmail.com'], [0.3, 92, 'yahoo.com'], ] labels = [0, 1, 1] model = RandomForest() model.train(email_features, labels)
  • 12.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build - Train - Tune - Deploy ● Models have tunable knobs, aka "hyperparameters" ● Different hyperparameters = different performance ● Train data set for training, validation data set for measuring performance ● Overfitting: your model is really good on your old data, but really bad on never-before-seen data
  • 13.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build - Train - Tune - Deploy Example def evaluate(num_leaves, max_depth): train_data, train_labels, validation_data, validation_labels = split(email_features, labels) model = RandomForest(num_leaves=num_leaves, max_depth=max_depth) model.train(train_data, train_labels) validation_score = model.score(validation_data, validation_labels) return validation_score
  • 14.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Build - Train - Tune - Deploy ● We train our model to solve our problem on old data but we really want to solve our problem on new data ● Create a REST endpoint for accessing the model ● A/B test different versions
  • 15.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. model = RandomForest(best_hyperparameters) model.train(emails, labels) def is_spam(email): email_features = extract_features(email) return model.predict(email_features) Build - Train - Tune - Deploy Example
  • 16.
    © 2018, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Thanks! Questions? alexandra@sigopt.com Twitter: @alexandraj777