Machine Learning - A Simplified view

© Gopinath Ramakrishnan, 2018 Page 1 of 8
Machine Learning - A Simplified View
Dr. Gopinath Ramakrishnan
Independent Consultant
Contact Details: e-mail: gopi@rgopinath.com, LinkedIn: https://in.linkedin.com/in/gopinathr
Part I - The Building Blocks
Introduction
In the coming years it will be extremely difficult to find things that are not powered by Machine
Learning. This article is targeted mainly for laymen who are curious to know more about
Machine Learning.
To achieve this objective the treatment of certain concepts have been oversimplified and may
present only a partial picture of the complex reality underlying the area of Machine Learning
In the first part of the article we will define Machine Learning and discuss the three basic
building blocks of any Machine Learning System.
What is Machine Learning?
Machine Learning is a process through which a software system continuously gets better in
doing a certain task as it does more and more of similar tasks. For example the spam filtering
feature in your email account gets better and better in identifying spam messages as it processes
more and more emails. This is very much like a human learning how to do a job better by
making use of the knowledge and experience gained by doing similar jobs earlier.
The Building Blocks of Machine Learning
The building blocks of a Machine Learning system are:
a) Data
b) Model
c) Algorithm.
Data
Data is the raw material for Machine Learning. So first of all we need data pertaining to the
problem to be solved.

Here are some examples of data and the problems to be solved:
a) For predicting the sales price of a newly constructed house, the data required would be
the sale prices of several houses that have already been sold, their built up area, number
of rooms in the house, their distance from the city center etc.
b) The data needed for developing an email spam filtering software would be previously
received emails that have already been correctly classified as either spam or not spam.
c) To identify the different topics that are trending on the social media sites like Twitter or
Facebook we need as many existing tweets or postings as possible.
Model
A model is a simplified version of a real world system. It describes or simulates the behavior of
this system. In Machine Learning context, a model consists of mathematical equations or logical
rules built into a computing system as software programs. Such models are called computational
models.
A model takes the available data as an input and produces solution to a given problem in form of
some predictions or inferences.
The solution to a problem is influenced by many factors. For e.g. the predicted sale price of a
house may be dependent on its built up area, number of rooms it has, its distance from the city
center etc.
Similarly, a spam email may be characterized by the frequency of occurrence of certain words or
word combinations in its content.
Machine Learning models describe how such factors combine to influence the solution. When
we feed the model with the data that has values of these factors, it generates solutions like sale
price predictions, spam detection etc.
Algorithm
How do we ensure that the solutions provided by the models are of high quality and generated
quickly in an efficient manner? This is where the crucial role of Algorithms comes in.
Algorithms are the work-horses that power the Machine Learning process. They reside in the
computing system as software programs that run when a problem needs to be solved.
An algorithm is a sequence of steps determined by the experts to:
a) Recognize and interpret the underlying patterns and relationships in the data fed into the
system
b) Build the computational models
c) Train the model i.e. to enable the model to find out from the input data to what extent
different factors influence the solution.
d) Evaluate the solutions generated by the model
e) Refine the models to provide better solutions.

How Machine Learning Happens
Now let us see how Data, Model and Algorithms work together to enable computing systems to
learn and develop artificial intelligence that can be used to solve problems.
We provide the following three examples to illustrate the process of machine learning.
Example 1: Predicting the sales price of a house - The sales price of a house depends on several
factors. For e.g., the built up area, number of rooms it has, its location, its age (how many years
has passed since it was built) etc. These factors or any combinations of them are called
“features”.
We feed the computer with data consisting of the sale prices and the values of features of the
houses already sold. The algorithm first analyzes this data and figures out to what extent each
feature influences the sale prices. In other words it learns the nature of the relationship between
the features and the known sales prices of the sold houses. With this information at its disposal it
constructs a model that can reasonably predict the sales price of any unsold house when the
values of that house’s features are given as an input to the computer.
Example 2: Filtering out spam e-mails-We provide the computer input data in form of
previously received e-mails which have been already correctly classified as Spam or Not Spam.
After reading these emails and their known classification the algorithm identifies those features
of the email that typically characterize a spam. Features in this case are generally the frequency
of certain words or word combinations appearing in the e-mail. The algorithm now builds a
model that can classify a new incoming mail as Spam or Not Spam based on the occurrence such
words or word combinations and filters out the spam e-mails.
Example 3: Identifying trending topics in social media-The algorithm reads the existing social
media data like Facebook postings and tweets from Twitter and discovers the trending topics. It
then creates a model which typically consists of decision rules to classify any new posting or
tweets as per the trending topics.
In the next part of this article, we will discuss the different types of Machine Learning.

Part II - Types of Machine Learning
Introduction
In the first part of the article we mentioned that to enable the model of a system to solve a
problem we must first “train” it to “learn” from the known and available data related to the
problem. This data is called “training data”.
There are two ways a model can be made to learn from the training data:
1) Supervised Learning
2) Unsupervised Learning
Supervised Learning
The objective of Supervised Learning is to learn how to:
a) Predict an outcome from a given set of data. For example predicting the - real estate
prices; winner of a horse race; number of runs a cricket or a baseball team will score etc.
or
b) Classify the given set of data into categories pre-determined by the humans. For example
classifying - an email as spam or not spam; a tumor in the body as benign or malignant; a
person entering a restricted area in a factory as an authorized person or unauthorized
person etc.
The training data used for supervised learning is gathered from the previous instances of the
problems solved. It consists of two components:
a) Known or Labeled Responses
b) Features
Known or Labeled Responses are values of the previous outcomes or classifications that are
already known. For example if we want to predict the sale price of a newly constructed house,
we need the known response data i.e. the sale prices of houses which were already sold.
Another example of Labeled Response is related to the email spam filtering problem where we
need several emails which have already been correctly classified as spam or not spam.
Features are the factors or their combinations that may have impacted the previous outcomes or
classifications. In the case of sale price prediction we need the value of features like the built up
area in square feet, number of rooms, distance from city center etc. for the houses which have
been sold.
For email spam filtering case we require as features, the frequency of occurrence in the email of
the words or word combinations that characterize the email as a spam or not spam.

Knowing beforehand the values of responses while the system model is learning is similar to a
supervisor guiding someone how to solve a problem by revealing the solutions of previously
solved problems of similar nature.
Hence this type of learning is called Supervised Learning. This is the most common type of
method used in the industry.
Unsupervised Learning
The objective of Unsupervised Learning is to learn to discover the hidden patterns and
interrelationships in the given data set and divide it into several clusters. Data points in the same
cluster will be similar to one another as compared to data points in other cluster.
Some common examples of clustering are – identifying several prevalent themes from a set of
newspaper articles; defining various segments of customers based on their buying pattern;
determining different types of potential hacking patterns on a website.
Unlike Supervised Learning the training data used for Unsupervised Learning does not contain
known responses to the problem. It contains only the known features of the data points.
There are no pre-defined classes or labels (like for example spam mail or non-spam
mail) defined by humans to which a data point (e.g. an email) can be assigned. The clusters are
automatically created by the machine learning algorithm based on pre-specified similarity
criteria for the data points. In other words there is no supervisor to guide the model by presenting
it with pre-classified samples of data.
Hence this type of learning is called Unsupervised Learning. This is method is mainly used for
initial exploration of data.
The next questions that arise are - How well the models have done their job of predicting the
results or classifying or clustering the data? How to measure the quality of solutions provided by
the models?
In the next part of this article , we will discuss these aspects of Machine Learning.

Part III - Evaluating Machine Learning Models
Introduction
The Part-I of this article described the building blocks of Machine Learning. In Part-II we
discussed the two different ways Machine Learning takes place.
The questions that arise now are - how good are the solutions provided by the Machine Learning
models? How do we know how well the models have done their job of predicting the results or
classifying or clustering the data? We need to have some means to measure these aspects.
This final part of the article describes some metrics that are commonly used to measure the
quality of the solutions provided by a model.
Evaluation of Unsupervised Learning Models
We have seen earlier that an Unsupervised Learning model divides the input data set presented to
it into several clusters. The clusters are created on the fly by the algorithm while it discovers the
hidden patterns and interrelationships in the data set.
The quality of clustering is generally measured by two metrics calculated by the algorithm –
Intra-cluster distance and Inter-cluster distance.
Intra-cluster distance quantifies how close (i.e. how similar) the characteristics of the data
points within the same cluster are to one another. Inter-cluster distance measures how distinct
each cluster is with respect to every other cluster created by the algorithm.
A model that produces clusters with relatively smaller Intra-cluster distances and relatively larger
Inter-cluster distances is considered good. This is because for such a group of clusters, the data
points within a given cluster are very similar and have distinctly different characteristics than the
data points in every other cluster.
For example consider a model that divides a set of news articles into clusters based on their
themes. In case of Unsupervised Learning the themes are not pre-specified by humans. Instead
they have to be discovered by the Machine Learning algorithm while creating the model.
Suppose the model creates three clusters which on inspection by an analyst found to be roughly
matching three themes - politics, sports and business.
If the model is good these will have small distances among the articles on the same theme
(Intra-cluster distances) say politics and will have large distances between the articles on two
different themes (Inter-cluster distance) say politics and sports which are in different clusters.

Evaluation of Supervised Learning Models
Supervised Learning models are of two types - Prediction Models & Classifiers. We will now
discuss the metrics used to evaluate them.
Prediction Models
One of the commonly used measurements of the quality of a prediction model is Mean-Squared
Error (MSE). The algorithm compares the predicted values of target (for e.g. sale prices of the
houses) with the actual values already available from the data set and calculates the Mean
Squared-Error (MSE).
MSE is the average of the squares of the differences between the predicted values and actual
values of the data points fed into the model. The models with lower MSEs are considered to be
better.
Classifiers
In the case of a Classifier the algorithm compares the model’s classification of data points (for
e.g. spam/non-spam email) with the actual classification results available from the data set and
calculates typically three performance metrics – Accuracy, Precision and Recall.
Let us understand these metrics through the example of a Spam Classifier model. Suppose we
have 100 emails in the data set which have already been classified correctly as spam or non-
spam. Say we have 50 spam emails and 50 non-spam emails.
We feed the text of these emails as an input to the Spam Classifier model and observe how it
classifies the emails.
If the model classifies 70 emails correctly as either spam or not spam we will say that its
Accuracy is 70 % (i.e. 70/100 *100).
Accuracy measures how closely the overall classifications done by the model correspond to the
correct classification.
Let us assume that out of 100 emails, 40 emails are classified as spam and 60 of them as non-
spam. If out of the 40 emails classified as spam, only 30 of them are genuinely spam then the
Precision of the classifier in detecting spam is 75 % (i.e. 30/40*100).
Precision is the measure of how well the model is able to do relevant classifications. Relevant
classification means the classification that we are primarily interested in. In our example it is the
classification of the emails as spam we are more interested in as compared to classifying an
email as non-spam.
As we mentioned earlier there are actually 50 spam emails in the data set. The classifier in our
example has correctly identified only 30 spam emails. We will say that in such case that the
Recall of the classifier is 60% (i.e. 30/50*100).

Recall is the measure of the extent to which all the relevant items in the data set are correctly
classified. In our example all the spam mails actually present in the data set are the relevant
items.
Accuracy is generally not a good metric to measure the classifier’s performance; especially if
the class distribution in the data set is skewed (i.e. most of the data points belong to a particular
class. For example if only 1% of the emails are spam e-mails).
Precision and Recall are generally used to evaluate the classifier’s performance.
The type of problem being solved determines which one is a better indicator of the classifier’s
effectiveness.
For mission-critical situations like tumor detection, it is extremely important to identify all the
positive cases of tumor, even at the cost of misclassifying non-tumors as tumors. In these cases
the classifier model must have very high Recall.
For cases like a product website, the objective of the classifier is to automatically select and
display mostly the positive reviews from the customers. It is more important to avoid classifying
negative reviews as positive even if that means some positive reviews are getting misclassified as
negative reviews and are not displayed on the website. Here the classifier model should have
very high Precision.
Generally High Recalls are associated with Low Precision and vice versa. In real life
situations there is a trade-off between precision and recall depending on the prime objective of
the classifier.
Closing Remarks
Through this 3-part article we have attempted to explain to the laymen the basic concepts of
Machine Learning.
We intend to deep-dive into some of the finer aspects of Machine Learning and Data Science in
the forthcoming articles.

Machine Learning - A Simplified view

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Machine Learning - A Simplified view

Similar to Machine Learning - A Simplified view (20)

Recently uploaded

Recently uploaded (20)

Machine Learning - A Simplified view