What Is
Machine
Learning?
Machine learning is the science (and art) of
programming computers so they can learn from data.
Here is a slightly more general definition:
[Machine learning is the] field of study that gives
computers the ability to learn without being
explicitly programmed.
—Arthur Samuel, 1959
And a more engineering-oriented one:
A computer program is said to learn from
experience E with respect to some task T and
some performance measure P, if its performance
on T, as measured by P, improves with experience
E.
—Tom Mitchell, 1997
Types of Machine Learning Systems
There are so many different types of machine learning systems that it is useful to
classify them in broad categories, based on the following criteria:
How they are supervised during training (supervised, unsupervised, semi-
supervised, self-supervised, and others)
Whether or not they can learn incrementally on the fly (online versus batch
learning)
Whether they work by simply comparing new data points to known data points, or
instead by detecting patterns in the training data and building a predictive model,
much like scientists do (instance-based versus modelbased learning)
Training Supervision
• ML systems can be classified according to the amount and type of supervision they get
during training.
• There are many categories, but we’ll discuss the main ones:
• supervised learning
• unsupervised learning
• selfsupervised learning
• semi-supervised learning, and
• reinforcement learning.
Supervise
d learning
• In supervised learning,
the training set you feed to
the algorithm includes the
desired solutions, called
labels.
Supervised
learning
A typical supervised learning task is classification.
The spam filter is a good example of this: it is trained with many
example emails along with their class (spam or ham), and it
must learn how to classify new emails.
Another typical task is to predict a target numeric value, such as
the price of a car, given a set of features (mileage, age, brand,
etc.). This sort of task is called regression (Figure 1-6).
To train the system, you need to give it many examples of cars,
including both their features and their targets (i.e., their prices).
Supervised
learning
Note that some regression models can be used for classification
as well, and vice versa. For example, logistic regression is
commonly used for classification, as it can output a value that
corresponds to the probability of belonging to a given class
(e.g., 20% chance of being spam).
Note
NOTE
The words target and label are generally
treated as synonyms in supervised learning,
but target is more common in regression
tasks and label is more common in
classification tasks. Moreover, features are
sometimes called predictors or attributes.
These terms may refer to individual samples
(e.g., “this car’s mileage feature is equal to
15,000”) or to all samples (e.g., “the mileage
feature is strongly correlated with price”).
Unsupervis
ed learning
In unsupervised learning, as you might guess, the
training data is unlabeled (Figure 1-7). The system tries
to learn without a teacher.
For example, say you have a lot of data about your blog’s
visitors. You may want to run a clustering algorithm to
try to detect groups of similar visitors (Figure 1-8).
At no point do you tell the algorithm which group a visitor
belongs to: it finds those connections without your help.
For example, it might notice that 40% of your visitors are
teenagers who love comic books and generally read your
blog after school, while 20% are adults who enjoy
sci-fi and who visit during the weekends. If you use a
hierarchical clustering algorithm, it may also subdivide
each group into smaller groups. This may help you target
your posts for each group.
Unsupervis
ed learning
Unsupervis
ed learning
Unsupervis
ed learning
MNIST
Dataset
Training a Binary Classifier
Binary Classifier
Now let’s pick a classifier and train it.
A good place to start is with a stochastic gradient descent (SGD, or stochastic GD) classifier,
using ScikitLearn’s SGDClassifier class.
This classifier is capable of handling very large datasets efficiently. This is in part because
SGD deals with training instances independently, one at a time, which also makes SGD well
suited for online learning, as you will see later.
Let’s create an SGDClassifier and train it
on
the whole training set:
Now we can use it to detect
images of the number 5:
from sklearn.linear_model
import SGDClassifier
sgd_clf =
SGDClassifier(random_state=42
) sgd_clf.fit(X_train,
y_train_5)
Multiclass Classification
Whereas binary classifiers distinguish between two classes, multiclass
classifiers (also called multinomial classifiers) can distinguish between more
than two classes.
Some Scikit-Learn classifiers (e.g., LogisticRegression,
RandomForestClassifier, and GaussianNB) are capable of handling multiple
classes natively.
Others are strictly binary classifiers (e.g., SGDClassifier and SVC).
However, there are various strategies that you can use to perform multiclass
classification with multiple binary classifiers.
Multiclass Classification
One way to create a system that can classify the digit images into
10 classes (from 0 to 9) is to train 10 binary classifiers, one for each
digit (a 0-detector, a 1-detector, a 2-detector, and so on).
Then when you want to classify an image, you get the decision
score from each classifier for that image and you select the class
whose classifier outputs the highest score.
This is called the one-versus-the-rest (OvR) strategy, or sometimes
one-versus-all (OvA).
Multiclass Classification
Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and
1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on.
This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N –
1) / 2 classifiers.
For the MNIST problem, this means training 45 binary classifiers! When you want to classify an
image, you have to run the image through all 45 classifiers and see which class wins the most
duels.
The main advantage of OvO is that each classifier only needs to be trained on the part of the
training set containing the two classes that it must distinguish.
Multiclass Classification
Some algorithms (such as support vector machine classifiers) scale
poorly with the size of the training set.
For these algorithms OvO is preferred because it is faster to train
many classifiers on small training sets than to train few classifiers
on large training sets.
For most binary classification algorithms, however, OvR is
preferred.
Multiclass
Classificatio
n
Scikit-Learn detects when you try to
use a binary classification algorithm
for
a multiclass classification task, and it
automatically runs OvR or OvO,
depending on the algorithm.
Let’s try this with a support vector
machine classifier using the
sklearn.svm.SVC class.
We’ll only train on the first 2,000
images, or else it will take a very long
time:
Multiclass
Classificatio
n
That was easy! We trained the SVC using
the original target classes from 0 to 9
(y_train), instead of the 5-versus-the-rest
target classes (y_train_5).
Since there are 10 classes (i.e., more than
2), Scikit-Learn used the OvO strategy
and trained 45 binary classifiers.
Now let’s make a prediction on an image:
Multiclass
Classificatio
n
That’s correct! This code actually made 45
predictions—one per pair of classes—and it
selected the class that won the most duels.
If you call the decision_function() method, you
will see that it returns 10 scores per instance:
one per class.
Each class gets a score equal to the number of
won duels plus or minus a small tweak (max
±0.33) to break ties, based on the classifier
scores:
What is Machine Learning - Introduction to Machine Learning

What is Machine Learning - Introduction to Machine Learning

  • 2.
    What Is Machine Learning? Machine learningis the science (and art) of programming computers so they can learn from data. Here is a slightly more general definition: [Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed. —Arthur Samuel, 1959 And a more engineering-oriented one: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. —Tom Mitchell, 1997
  • 3.
    Types of MachineLearning Systems There are so many different types of machine learning systems that it is useful to classify them in broad categories, based on the following criteria: How they are supervised during training (supervised, unsupervised, semi- supervised, self-supervised, and others) Whether or not they can learn incrementally on the fly (online versus batch learning) Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus modelbased learning)
  • 4.
    Training Supervision • MLsystems can be classified according to the amount and type of supervision they get during training. • There are many categories, but we’ll discuss the main ones: • supervised learning • unsupervised learning • selfsupervised learning • semi-supervised learning, and • reinforcement learning.
  • 5.
    Supervise d learning • Insupervised learning, the training set you feed to the algorithm includes the desired solutions, called labels.
  • 6.
    Supervised learning A typical supervisedlearning task is classification. The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails. Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.). This sort of task is called regression (Figure 1-6). To train the system, you need to give it many examples of cars, including both their features and their targets (i.e., their prices).
  • 7.
    Supervised learning Note that someregression models can be used for classification as well, and vice versa. For example, logistic regression is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class (e.g., 20% chance of being spam).
  • 8.
    Note NOTE The words targetand label are generally treated as synonyms in supervised learning, but target is more common in regression tasks and label is more common in classification tasks. Moreover, features are sometimes called predictors or attributes. These terms may refer to individual samples (e.g., “this car’s mileage feature is equal to 15,000”) or to all samples (e.g., “the mileage feature is strongly correlated with price”).
  • 9.
    Unsupervis ed learning In unsupervisedlearning, as you might guess, the training data is unlabeled (Figure 1-7). The system tries to learn without a teacher. For example, say you have a lot of data about your blog’s visitors. You may want to run a clustering algorithm to try to detect groups of similar visitors (Figure 1-8). At no point do you tell the algorithm which group a visitor belongs to: it finds those connections without your help. For example, it might notice that 40% of your visitors are teenagers who love comic books and generally read your blog after school, while 20% are adults who enjoy sci-fi and who visit during the weekends. If you use a hierarchical clustering algorithm, it may also subdivide each group into smaller groups. This may help you target your posts for each group.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Binary Classifier Now let’spick a classifier and train it. A good place to start is with a stochastic gradient descent (SGD, or stochastic GD) classifier, using ScikitLearn’s SGDClassifier class. This classifier is capable of handling very large datasets efficiently. This is in part because SGD deals with training instances independently, one at a time, which also makes SGD well suited for online learning, as you will see later.
  • 16.
    Let’s create anSGDClassifier and train it on the whole training set: Now we can use it to detect images of the number 5: from sklearn.linear_model import SGDClassifier sgd_clf = SGDClassifier(random_state=42 ) sgd_clf.fit(X_train, y_train_5)
  • 17.
    Multiclass Classification Whereas binaryclassifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes. Some Scikit-Learn classifiers (e.g., LogisticRegression, RandomForestClassifier, and GaussianNB) are capable of handling multiple classes natively. Others are strictly binary classifiers (e.g., SGDClassifier and SVC). However, there are various strategies that you can use to perform multiclass classification with multiple binary classifiers.
  • 18.
    Multiclass Classification One wayto create a system that can classify the digit images into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit (a 0-detector, a 1-detector, a 2-detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-the-rest (OvR) strategy, or sometimes one-versus-all (OvA).
  • 19.
    Multiclass Classification Another strategyis to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N – 1) / 2 classifiers. For the MNIST problem, this means training 45 binary classifiers! When you want to classify an image, you have to run the image through all 45 classifiers and see which class wins the most duels. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set containing the two classes that it must distinguish.
  • 20.
    Multiclass Classification Some algorithms(such as support vector machine classifiers) scale poorly with the size of the training set. For these algorithms OvO is preferred because it is faster to train many classifiers on small training sets than to train few classifiers on large training sets. For most binary classification algorithms, however, OvR is preferred.
  • 21.
    Multiclass Classificatio n Scikit-Learn detects whenyou try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvR or OvO, depending on the algorithm. Let’s try this with a support vector machine classifier using the sklearn.svm.SVC class. We’ll only train on the first 2,000 images, or else it will take a very long time:
  • 22.
    Multiclass Classificatio n That was easy!We trained the SVC using the original target classes from 0 to 9 (y_train), instead of the 5-versus-the-rest target classes (y_train_5). Since there are 10 classes (i.e., more than 2), Scikit-Learn used the OvO strategy and trained 45 binary classifiers. Now let’s make a prediction on an image:
  • 23.
    Multiclass Classificatio n That’s correct! Thiscode actually made 45 predictions—one per pair of classes—and it selected the class that won the most duels. If you call the decision_function() method, you will see that it returns 10 scores per instance: one per class. Each class gets a score equal to the number of won duels plus or minus a small tweak (max ±0.33) to break ties, based on the classifier scores: