Machine learning involves using data to allow computers to learn without being explicitly programmed. There are three main types of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning. The typical machine learning process involves five steps: 1) data gathering, 2) data preprocessing, 3) feature engineering, 4) algorithm selection and training, and 5) making predictions. Generalization is an important concept that relates to how well a model trained on one dataset can predict outcomes on an unseen dataset. Both underfitting and overfitting can lead to poor generalization by introducing bias or variance errors.
2. What is MC Learning
www.skillslash.com
The subfield of computer science that “gives computers the ability to learn
without being explicitly programmed”.
(Arthur Samuel, 1959)
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P if its performance at tasks in T, as measured by P,
improves with experience E.”
(Tom Mitchell, 1997)
Using data for answering
questions
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40. High Bias and Low Variance
(Low Flexibility)
Low Bias and High Variance
(Too Flexibility)
Low Bias and High Variance
(Balanced Flexibility)
41. Bias Error:
The bias is known as the difference between the prediction of the values by the ML model and the correct
value. Being high in biasing gives a large error in training as well as testing data.
Variance Error:
Variance is the amount that the estimate of the target function will change if different training data was
used.
50. Types of Supervised ML
Supervise
d
Unsupervise
d
Reinforceme
nt
Output is a discrete variable
(e.g.,
Defaulter and Non Defaulter
Spam and non spam
Purchaser Non Purchaser)
Classificatio
n
Regressio
n
Output is continuous (e.g.,
price of house,
temperature)
www.skillslash.com
55. Types of Machine Learning
Problems
Supervise
d
Unsupervise
d
Reinforceme
nt
Supervise
d
Is this a cat or a dog?
Are these emails spam or not?
Unsupervised
Predict the market value of houses, given the
square meters, number of rooms,
neighborhood, etc.
Reinforcement
Learn through examples of which we know
the desired output (what we want to
predict).
56. Types of Machine Learning
Problems
Unsupervise
d
Supervised
There is no desired output. Learn something
about the data. Latent relationships.
I want to find anomalies in the credit card
usage patterns of my customers.
Reinforcement
I have photos and want to put them in
20 groups.
www.skillslash.com
57. Types of Machine Learning
Problems
Unsupervise
d
Supervise
d
Reinforceme
nt
Useful for learning structure in the data
(clustering), hidden correlations, reduce
dimensionality, etc.
www.skillslash.com
58. Environment gives feedback via a
positive or negative reward signal.
Unsupervised
Reinforceme
nt
Supervise
d
An agent interacts with an environment and
watches the result of the interaction.
Types of Machine Learning
Problems
www.skillslash.com
59.
60. Data Gathering
60
Might depend on human work
• Manual labeling for supervised learning.
• Domain knowledge. Maybe even experts.
May come for free, or “sort of”
• E.g., Machine Translation.
The more the better: Some algorithms need large amounts of data
to be useful (e.g., neural networks).
The quantity and quality of data dictate the model accuracy
www.skillslash.com
61. Data Preprocessing
61
Is there anything wrong with the data?
• Missing values
• Outliers
• Bad encoding (for text)
• Wrongly-labeled examples
• Biased data
• Do I have many more samples of one
class than the rest?
Need to fix/remove data?
www.skillslash.com
62. Feature Engineering
62
What is a feature?
A feature is an individual measurable
property of a phenomenon being
observed
Our inputs are represented by a set of
features.
To classify spam email, features could be:
• Number of words that have been
ch4ng3d
like this.
• Language of the email
Buy ch34p drugs
from the
ph4rm4cy now :)
:) :)
(2, 0, 3)
Feature
engineerin
g
www.skillslash.com
63. Feature Engineering
63
Extract more information from existing data, not adding “new” data
per-se
• Making it more useful
• With good features, most algorithms can learn
faster It can be an art
• Requires thought and knowledge of the
data Two steps:
• Variable transformation (e.g., dates into weekdays,
normalizing)
www.skillslash.com
64. Algorithm Selection & Training
64
Supervise
d
• Linear classifier
• Naive Bayes
• Support Vector Machines
(SVM)
• Decision Tree
• Random Forests
• k-Nearest Neighbors
• Neural Networks (Deep
learning)
Unsupervise
d
• PCA
• t-SNE
• k-mean
s
• DBSCAN
Reinforcemen
t
• SARSA–λ
• Q-Learnin
g
www.skillslash.com
65. 65
THE MACHINE LEARNING FRAMEWORK
y = f(x)
● Training: given a training set of labeled examples {(x1
,y1
), …,
(xN
,yN
)}, estimate the prediction function f by minimizing the
prediction error on the training set
● Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
output prediction
function
Image
feature
www.skillslash.com
66. Goal of training: making the correct prediction as often as
possible
• Incremental improvement:
• Use of metrics for evaluating performance and comparing
solutions
• Hyperparameter tuning: more an art than a science
Algorithm Selection & Training
66
Predic
t
Adjus
t
www.skillslash.com
67. Summary
67
• Machine Learning is intelligent use of data to answer questions
• Enabled by an exponential increase in computing power and
data availability
• Three big types of problems: supervised, unsupervised,
reinforcement
• 5 steps to every machine learning solution:
1. Data Gathering
2. Data Preprocessing
3. Feature Engineering
4. Algorithm Selection & Training
5. Making Predictions www.skillslash.com
68. Generalization
● How well does a learned model generalize from the data it
was trained on to a new test set?
Training set (labels known) Test set (labels
unknown)
69. Generalization
● Components of generalization error
○ Bias: how much the average model over all training sets differ from the true
model?
■ Error due to inaccurate assumptions/simplifications made by the model
■ Using very less features
○ Variance: how much models estimated from different training sets differ from
each other
● Underfitting: model is too “simple” to represent all the relevant class
characteristics
○ High bias and low variance
○ High training error and high test error
● Overfitting: model is too “complex” and fits irrelevant characteristics
(noise) in the data
○ Low bias and high variance
○ Low training error and high test error
70.
71.
72. Bias-Variance Trade-off
• Models with too few parameters are
inaccurate because of a large bias (not
enough flexibility).
• Bias can also come due to wrong
assumption.
• Lead to Train error
• Models with too many parameters are
inaccurate because of a large variance
(too much sensitivity to the sample).
• Lead to Test Error