What is Machine Learning - Introduction to Machine Learning
2.
What Is
Machine
Learning?
Machine learningis the science (and art) of
programming computers so they can learn from data.
Here is a slightly more general definition:
[Machine learning is the] field of study that gives
computers the ability to learn without being
explicitly programmed.
—Arthur Samuel, 1959
And a more engineering-oriented one:
A computer program is said to learn from
experience E with respect to some task T and
some performance measure P, if its performance
on T, as measured by P, improves with experience
E.
—Tom Mitchell, 1997
3.
Types of MachineLearning Systems
There are so many different types of machine learning systems that it is useful to
classify them in broad categories, based on the following criteria:
How they are supervised during training (supervised, unsupervised, semi-
supervised, self-supervised, and others)
Whether or not they can learn incrementally on the fly (online versus batch
learning)
Whether they work by simply comparing new data points to known data points, or
instead by detecting patterns in the training data and building a predictive model,
much like scientists do (instance-based versus modelbased learning)
4.
Training Supervision
• MLsystems can be classified according to the amount and type of supervision they get
during training.
• There are many categories, but we’ll discuss the main ones:
• supervised learning
• unsupervised learning
• selfsupervised learning
• semi-supervised learning, and
• reinforcement learning.
5.
Supervise
d learning
• Insupervised learning,
the training set you feed to
the algorithm includes the
desired solutions, called
labels.
6.
Supervised
learning
A typical supervisedlearning task is classification.
The spam filter is a good example of this: it is trained with many
example emails along with their class (spam or ham), and it
must learn how to classify new emails.
Another typical task is to predict a target numeric value, such as
the price of a car, given a set of features (mileage, age, brand,
etc.). This sort of task is called regression (Figure 1-6).
To train the system, you need to give it many examples of cars,
including both their features and their targets (i.e., their prices).
7.
Supervised
learning
Note that someregression models can be used for classification
as well, and vice versa. For example, logistic regression is
commonly used for classification, as it can output a value that
corresponds to the probability of belonging to a given class
(e.g., 20% chance of being spam).
8.
Note
NOTE
The words targetand label are generally
treated as synonyms in supervised learning,
but target is more common in regression
tasks and label is more common in
classification tasks. Moreover, features are
sometimes called predictors or attributes.
These terms may refer to individual samples
(e.g., “this car’s mileage feature is equal to
15,000”) or to all samples (e.g., “the mileage
feature is strongly correlated with price”).
9.
Unsupervis
ed learning
In unsupervisedlearning, as you might guess, the
training data is unlabeled (Figure 1-7). The system tries
to learn without a teacher.
For example, say you have a lot of data about your blog’s
visitors. You may want to run a clustering algorithm to
try to detect groups of similar visitors (Figure 1-8).
At no point do you tell the algorithm which group a visitor
belongs to: it finds those connections without your help.
For example, it might notice that 40% of your visitors are
teenagers who love comic books and generally read your
blog after school, while 20% are adults who enjoy
sci-fi and who visit during the weekends. If you use a
hierarchical clustering algorithm, it may also subdivide
each group into smaller groups. This may help you target
your posts for each group.
Binary Classifier
Now let’spick a classifier and train it.
A good place to start is with a stochastic gradient descent (SGD, or stochastic GD) classifier,
using ScikitLearn’s SGDClassifier class.
This classifier is capable of handling very large datasets efficiently. This is in part because
SGD deals with training instances independently, one at a time, which also makes SGD well
suited for online learning, as you will see later.
16.
Let’s create anSGDClassifier and train it
on
the whole training set:
Now we can use it to detect
images of the number 5:
from sklearn.linear_model
import SGDClassifier
sgd_clf =
SGDClassifier(random_state=42
) sgd_clf.fit(X_train,
y_train_5)
17.
Multiclass Classification
Whereas binaryclassifiers distinguish between two classes, multiclass
classifiers (also called multinomial classifiers) can distinguish between more
than two classes.
Some Scikit-Learn classifiers (e.g., LogisticRegression,
RandomForestClassifier, and GaussianNB) are capable of handling multiple
classes natively.
Others are strictly binary classifiers (e.g., SGDClassifier and SVC).
However, there are various strategies that you can use to perform multiclass
classification with multiple binary classifiers.
18.
Multiclass Classification
One wayto create a system that can classify the digit images into
10 classes (from 0 to 9) is to train 10 binary classifiers, one for each
digit (a 0-detector, a 1-detector, a 2-detector, and so on).
Then when you want to classify an image, you get the decision
score from each classifier for that image and you select the class
whose classifier outputs the highest score.
This is called the one-versus-the-rest (OvR) strategy, or sometimes
one-versus-all (OvA).
19.
Multiclass Classification
Another strategyis to train a binary classifier for every pair of digits: one to distinguish 0s and
1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on.
This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N –
1) / 2 classifiers.
For the MNIST problem, this means training 45 binary classifiers! When you want to classify an
image, you have to run the image through all 45 classifiers and see which class wins the most
duels.
The main advantage of OvO is that each classifier only needs to be trained on the part of the
training set containing the two classes that it must distinguish.
20.
Multiclass Classification
Some algorithms(such as support vector machine classifiers) scale
poorly with the size of the training set.
For these algorithms OvO is preferred because it is faster to train
many classifiers on small training sets than to train few classifiers
on large training sets.
For most binary classification algorithms, however, OvR is
preferred.
21.
Multiclass
Classificatio
n
Scikit-Learn detects whenyou try to
use a binary classification algorithm
for
a multiclass classification task, and it
automatically runs OvR or OvO,
depending on the algorithm.
Let’s try this with a support vector
machine classifier using the
sklearn.svm.SVC class.
We’ll only train on the first 2,000
images, or else it will take a very long
time:
22.
Multiclass
Classificatio
n
That was easy!We trained the SVC using
the original target classes from 0 to 9
(y_train), instead of the 5-versus-the-rest
target classes (y_train_5).
Since there are 10 classes (i.e., more than
2), Scikit-Learn used the OvO strategy
and trained 45 binary classifiers.
Now let’s make a prediction on an image:
23.
Multiclass
Classificatio
n
That’s correct! Thiscode actually made 45
predictions—one per pair of classes—and it
selected the class that won the most duels.
If you call the decision_function() method, you
will see that it returns 10 scores per instance:
one per class.
Each class gets a score equal to the number of
won duels plus or minus a small tweak (max
±0.33) to break ties, based on the classifier
scores: