SlideShare a Scribd company logo
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
IMPERIAL COLLEGE LONDON
AN INTRODUCTORY REVIEW OF
MACHINE LEARNING ALGORITHMS AND
THEIR APPLICATION TO DATA MINING
DEPARTMENT OF MECHANICAL ENGINEERING
GUY RIESE
19/12/2014
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
i
Abstract
This review aims to provide an introduction to machine learning by reviewing literature on the
subject of supervised and unsupervised machine learning algorithms, development of
applications and data mining. In supervised learning, the focus is on regression and
classification approaches. ID3, bagging, boosting and random forests are explored in detail. In
unsupervised learning, hierarchical and K-means clustering are studied. The development of a
machine learning application starts by collecting and preparing data, then choosing and
training an algorithm and finally using your application. Large data sets are on the rise with
growing use of the World Wide Web, opening up opportunities in data mining where it is
possible to extract knowledge from raw data. It is found that machine learning has a vast range
of applications in everyday life and industry. The elementary introduction provided by this
review offers the reader a sound foundational basis with which to begin experimentation and
exploration of machine learning applications in more depth.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
ii
Contents
1 Introduction..............................................................................................................................1
1.1 Objectives......................................................................................................................... 2
2 Supervised Machine Learning Algorithms ............................................................................. 2
2.1 Regression ........................................................................................................................ 3
2.2 Classification Decision Tree Learning............................................................................. 5
2.2.1 ID3.............................................................................................................................6
2.2.2 Bagging and Boosting............................................................................................... 7
2.2.3 Random Forests........................................................................................................9
3 Unsupervised Machine Learning Algorithms.........................................................................9
3.1 Clustering........................................................................................................................10
3.1.1 Hierarchical Clustering: Agglomerative and Divisive............................................10
3.1.2 K-means....................................................................................................................11
4 Steps in developing a machine learning application............................................................. 13
4.1 Collect Data..................................................................................................................... 13
4.2 Choose Algorithm........................................................................................................... 13
4.3 Prepare Data.................................................................................................................... 13
4.4 Train Algorithm ..............................................................................................................14
4.5 Verify Results ..................................................................................................................14
4.6 Use Application...............................................................................................................14
5 Data Mining............................................................................................................................15
6 Discussion...............................................................................................................................16
6.1 Literature.........................................................................................................................16
6.2 Future Developments .....................................................................................................16
7 Conclusion ..............................................................................................................................17
8 References...............................................................................................................................17
9 Acknowledgements ............................................................................................................... 22
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
1
1 Introduction
Computers solve problems using algorithms. These algorithms are step-by-step instructions for
the computer to sequentially follow in processing a set of inputs into a set of outputs. These
algorithms are typically written line-by-line by computer programmers. But what if we donโ€™t
have the expertise or fundamental understanding to be able to write the algorithm for a
program?
For example, consider filtering spam emails from genuine emails (Alpaydin, 2010). For this
problem, we know the input (an email) and the output (identifying it as spam or genuine) but
we donโ€™t know what actually classifies it as a spam email. This lack of understanding often arises
when there is some intellectual human involvement in the problem we are trying to solve. In
this example, the human involvement is that a human wrote the original spam email.
Similarly, humans are involved in handwriting recognition, natural language processing and
facial recognition. It is clear that these problems are something that our subconscious is able to
handle effortlessly yet we donโ€™t consciously understand the fundamentals of the process. For
sequential logical tasks, like sorting a list alphabetically, we consciously understand the
fundamental process and therefore can program a solution (algorithm). But this isnโ€™t possible
for more complex tasks where the process is more of an unknown โ€˜black boxโ€™.
Machine learning is what gives us the tools to solve these โ€˜black boxโ€™ problems. โ€œWhat we lack
in knowledge, we make up for in dataโ€ (Alpaydin, 2010). Using the spam example, we can use a
data set of millions of emails, some of which are spam, in order to โ€˜learnโ€™ what defines a spam
email. The learning principles are derived from statistical approaches to data analysis. In this
way, we do not need to understand the process but we can construct an accurate and functional
model (a โ€˜black boxโ€™) to approximate the process. Whilst this doesnโ€™t explain the fundamental
process, it can identify some patterns and regularities that allow us to reach solutions.
Artificial intelligence was conceived in the mid-20th
century but it was not until the 1980s that
the more statistical branch, machine learning, began to separate off and become a field in its
own right (Russell, 2010). Machine learning developed a scientific approach to solving problems
of prediction and finding patterns in data. This quickly had value in industry which fuelled the
academic exploration further. But entering the 21st
century we have seen rapid rise in machine
learning popularity. This is largely due to the emergence of large data sets and the demand for
data mining processes to extract knowledge from them. Machine learning has since established
itself as a leading field of computer science with applications ranging from detecting credit card
fraud to medical diagnosis.
Data mining is the process of mining data in order to extract knowledge (Kamber, 2000). With
the rise of large data sets (โ€˜big dataโ€™), data mining has thrived. Data mining tasks can be
categorised as either descriptive tasks or predictive tasks. A descriptive task involves extracting
qualitative characteristics of data. For example, if you have a database of customers and want to
segment the customers into groups in order to find trends within those groups. A predictive
task involves using the existing data to be able to make predictions on future data inputs. For
example, how can we learn from our existing customers which products might be favoured by
a new customer?
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
2
Machine learning is a vast subject with masses of literature. One of the main challenges in
understanding machine learning is knowing where to start. This review will introduce the two
main approaches of machine learning: supervised and unsupervised learning. We consider some
of the more generalist and flexible machine learning algorithms in these categories relevant to
data mining and introduce some methods of optimising them. Additionally, this review will
indicate the steps to develop a machine learning application to solve a specific problem. Finally
we relate this theory and practical understanding to the application of data mining. With this
knowledge, the reader will have a strong machine learning foundation to enable them to
approach problems and interpret relevant research themselves.
1.1 Objectives
1. Understand the background of Machine Learning. What are some of the
key approaches and applications?
2. Understand some of the different mechanisms behind Machine Learning processes.
3. Explore machine learning algorithms and the decision making process of a machine
learning program.
4. How do you develop a machine learning application?
5. Case/Application Focus: Investigate machine learning in relation to data mining.
6. Briefly discuss key areas for future development of this technology.
2 Supervised Machine Learning Algorithms
The aim of a supervised machine learning algorithm is to learn how inputs relate to outputs in
a data set and thereby produce a model able to map new inputs to inferred outputs (Ayodele,
2010). Therefore, a complete set of training data is prerequisite for any supervised learning task.
A general equation for this can be defined as follows (Alpaydin, 2010):
๐‘ฆ = โ„Ž (๐‘ฅโ”‚๐œƒ) Eq. 2.1
Where the output, ๐‘ฆ, is equal to the function, โ„Ž, which is a function of the inputs, ๐‘ฅ, and the
features, ๐œƒ. The role of the supervised machine learning algorithm is to optimise the parameters
( ๐œƒ) by minimising the approximation error and thereby producing the most accurate outputs.
In laymanโ€™s terms, this means that existing โ€˜right answersโ€™ are used to predict new answers to
the problem; it learns from examples (Russell, 2010). We are unequivocally telling the algorithm
what we want to know and actively training it to be able to solve our problem. Supervised
learning consists of two fundamental stages; i) training and ii) prediction.
Building a bird classification system is a problem that can be solved with a supervised machine
learning algorithm (Harrington, 2012). Start by taking characteristics of the object you are trying
to classify, called features or attributes. For a bird classification system, these could be weight,
wingspan, whether feet are webbed and the colour of its back. In reality, you can have an infinite
number of features rather than just four (Ng, 2014). The features can be of different types. In
this example, weight and wingspan are numeric (decimal), whether feet are webbed is simply
yes or no (binary) and if you choose a selection of say 7 different colours then each โ€˜back colourโ€™
would just be an integer. According to Eq. 3.4, we want to find a function (โ„Ž) which we can use
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
3
to determine the bird species (๐‘ฆ) given inputs of particular features (๐‘ฅ). To achieve this, we
require training data (i.e. data on the weight, wingspan, etc. of a number of bird species). The
training data is used (stage (i)) to determine the parameters (๐œƒ) which can be used to define a
function โ„Ž. Itโ€™s unlikely this will be perfectly accurate, so we can compare the outputs from our
function on a test set (where we secretly already know the true outputs) in order to measure the
accuracy. Provided the function is accurate, we can use our model to predict bird species given
new inputs of weight, wingspan etc., perhaps entered by users trying to identify a bird (stage
(ii)).
This example is extremely simplistic and leaves many questions unanswered such as how do we
choose the features, how do we reach a definition for the model/function โ„Ž, how do we optimise
our algorithm for maximum accuracy and how could we deal with imperfect training data
(noise)? The sections which follow will seek to answer these questions. Regression and
classification are both supervised learning tasks where a model is defined with a set of
parameters. A regression solution is appropriate when the output is continuous, whereas a
classification solution is used for discrete outputs (Ng, 2014; Harrington, 2012).
2.1 Regression
In regression analysis the output is a random variable (๐‘ฆ) and the input the independent variable
(๐‘ฅ). We seek to find the dependence of ๐‘ฆ on ๐‘ฅ. The mean dependence of ๐‘ฆ on ๐‘ฅ will give us the
function and model (โ„Ž) that we are seeking to define (Kreyszig, 2006). The most basic form of
regression using just one independent variable is called univariate linear regression. This can be
used to produce a straight line function:
โ„Ž(๐‘ฅ) = ๐œƒ0 + ๐œƒ1 ๐‘ฅ Eq. 2.2
By finding ๐œƒ0 and ๐œƒ1 it is therefore possible to fully define the model. In seeking to choose ๐œƒ0
and ๐œƒ1 so that โ„Ž is as close to our (๐‘ฅ,๐‘ฆ) values as possible, we must minimise the Gauss function
of squared errors (Stigler, 1981; Freitas, 2013; Beyad & Maeder, 2013):
๐ฝ(๐œƒ0, ๐œƒ1) = โˆ‘(โ„Ž(๐‘ฅ๐‘–) โˆ’ ๐‘ฆ๐‘–)2
๐‘›
๐‘–=1
Eq. 2.3
To minimise this function, we can apply the
gradient descent algorithm known as the method
of steepest descent (Ng, 2014; Bartholomew-
Biggs, 2008; Kreyszig, 2006; Snyman, 2005;
Akaike, 1974). Gradient descent is a numerical
method used to minimise a multivariable
function by iterating away from a point along the
direction which causes the largest decrease in the
function (the direction with the most negative
gradient or โ€˜downwards steepnessโ€™).
The equation for gradient descent is as follows:
๐œƒ๐‘— = ๐œƒ๐‘— โˆ’ ๐›ผ
๐œ•
๐œ•๐œƒ๐‘—
๐ฝ(๐œƒ0, ๐œƒ1) Eq. 2.4
Figure 2.1. Gradient descent. (Kreyszig, 2006)
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
4
Where j = 0, 1 for this case of two unknowns. ๐›ผ is the step size taken and is known as the learning
rate. The value of the learning rate determines a) whether gradient descent converges to the
minimum or not and b) how quickly it converges. If the learning rate is too small, gradient
descent can be slow. On the other hand, if the learning rate is too large, the steps taken may be
too large resulting in overshoot and missing of the minimum. Figure 2.1 illustrates gradient
descent from a starting point of ๐‘ฅ0 = ๐œƒ๐‘—
0
iterating to ๐‘ฅ1 = ๐œƒ๐‘—
1
and ๐‘ฅ2 = ๐œƒ๐‘—
2
. Eventually this will
reach the minimum which lies at the centre of the innermost circle.
An analogy to gradient descent is the idea of walking on the side of a hill in a valley surrounded
by thick fog. The aim is to get to the bottom of the valley. Even though you cannot see where
the bottom of the valley is, as long as each step you take is sloping downwards, you will certainly
reach the bottom.
Gradient descent is not the fastest minimisation method, however, it offers a distinct approach
which is repeatedly used in many machine learning optimisation problems. Furthermore, it
scales well with larger data sets (Ng, 2014) which is a significant factor in real life applications.
Sub-gradient projection is a possible alternative to the descent method, however, it is typically
slower than gradient descent (Kiwiel, 2001). With an appropriate learning rate, gradient descent
serves as a reliable and effective tool for minimisation problems.
Hence by finding values for the parameters (๐œƒ๐‘—) we are able to find an equation for the model
(โ„Ž). If this model can predict values of ๐‘ฆ for novel examples, we say that it โ€˜generalisesโ€™ well
(Russell, 2010). In this example, we have applied only linear regression (a 1-degree polynomial).
It is possible to increase the hypothesis (โ„Ž) to a polynomial of a higher degree whereby the fit is
more accurate (curved). However, as you increase the degree of the polynomial, you increase
the risk of over-fitting the data; there is a balance to be reached between fitting the training
data well and producing a model that generalises the data better (Sharma, Aiken & Nori, 2014).
The main approach for dealing with this problem is to use the principle of Ockhamโ€™s razor: use
the simplest hypothesis consistent with the data (Allaby, 2010). For example, a 1-degree
polynomial is simpler than a 7-degree polynomial, so although the latter may fit training data
better, the former should be preferred. It is possible to further simplify models by reducing the
number of features being considered. This is achieved by discarding features which do not
appear relevant (Ng, 2014; Russell, 2010).
Regression is a simple yet powerful tool which can be used to teach a program to understand
data inputs and accurately predict data outputs through machine learning processes.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
5
2.2 Classification Decision Tree Learning
Decision trees are a flowchart-like
method of classifying a set of data
inputs. The input is a vector of features
and the output is a single and unified
โ€˜decisionโ€™ (Russell, 2010). This means
that the output is binary; it can either
be true (1) or false (0). A decision tree
performs a number of tests on the data
by asking questions about the input in
order to filter and categorise it. This is
a natural way to model how the human
brain thinks through solving
problems; many troubleshooting tools
and โ€œHow-Toโ€ manuals are structured
like decision trees. It begins at the root
node, extends down branches through nodes of classification tests (decision nodes) and finally
ends at a node representing a โ€˜leafโ€™ (terminal nodes) (Criminisi & Shotton, 2013). The aim is to
develop a decision tree using training data which can then be used to interpret and classify novel
data for which the classification is unknown.
The first step in the decision tree learning process is to induce or โ€˜growโ€™ a decision tree from
initial training data. We take input features/attributes and transform these into a decision tree
based on provided example outputs in training data. In the example in Figure 2.2, the features are
Patrons (how many people are currently sitting in the restaurant), WaitEstimate (the wait
estimated by the front of house), Alternate (whether there is another restaurant option nearby),
Hungry (whether customer is already hungry) and so on. The output is a decision on whether
to wait for a table or not. The decision tree learning algorithm employs a โ€˜greedyโ€™ strategy of
testing the most divisive attribute first (Russell, 2010). Each test divides the problem up further
into sub-problems which will eventually classify the data. It is important that the training data
set is as complete as possible in order to prevent decision trees being induced with mistakes. If
the algorithm does not have an example for a particular scenario (e.g. WaitTime of 0-10 minutes
when Patrons is full) then it could output a tree which consistently makes the wrong decision
for this scenario.
One of the mathematical ways in which decision tree divisions are quantifiably scored is with
the measure of Information Gain (๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘›) (Myles et al., 2004; Mingers, 1989). ๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› is a
mathematical tool for measuring how effectively a decision node divides the example data. This
is based on the concept of information (๐ผ๐‘›๐‘“๐‘œ) defined by Eq. 2.5 (Myles et al., 2004):
๐ผ๐‘›๐‘“๐‘œ = โˆ’ โˆ‘ (
๐‘๐‘—(๐‘ก)
๐‘(๐‘ก)
) log2 (
๐‘๐‘—(๐‘ก)
๐‘(๐‘ก)
)
๐‘—
Eq. 2.5
Where ๐‘๐‘—(๐‘ก) is number of examples in category ๐‘— at the node ๐‘ก and ๐‘(๐‘ก) is the number of
examples at the node ๐‘ก. The maximum change in information by being processed by a decision
node is defined by Eq. 2.6 (Myles et al., 2004):
๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› = ๐ผ๐‘›๐‘“๐‘œ(๐‘ƒ๐‘Ž๐‘Ÿ๐‘’๐‘›๐‘ก) โˆ’ โˆ‘(๐‘ ๐‘˜)๐ผ๐‘›๐‘“๐‘œ(๐ถโ„Ž๐‘–๐‘™๐‘‘ ๐‘˜)
๐‘˜
Eq. 2.6
Figure 2.2. A decision tree for deciding whether to wait for a table.
(Russell, 2010)
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
6
Where ๐‘ ๐‘˜ is the proportion of examples that are filtered into the ๐‘˜th category. The optimal
decision node is therefore the node which maximises this โ€˜change in informationโ€™.
Despite this quantification, there are usually several decision trees which are capable of
classifying the data. To choose the optimal decision tree, inductive bias is employed (Mitchell,
1997). The inductive bias depends on the particular type of decision tree algorithm and will be
explored in Section 2.2.1.
Once a decision tree has been grown, the decision tree algorithm may prune the tree (Russell,
2010; Myles et al., 2004). This combats overfitting whilst dealing with noisy data by removing
irrelevant decision nodes (Quinlan, 1986). The algorithm must also separately identify and
remove features which do not aid the division of examples. The chi-squared significance test is
the statistical method employed for this (supported by both (Quinlan, 1986) and (Russell, 2010))
known as chi-squared pruning. The data is analysed with the null hypothesis of โ€˜no underlying
patternโ€™. The extent at which degree of deviation occurs in novel data compared to the training
data is calculated and a cut off of say 5% significance is applied. In this way, noise in the training
data is handled and the tree design is optimised.
Multiple decision tree algorithms exist, exhibiting a variety of approaches. However, the most
effective use of them is to combine their methodology into an ensemble algorithm in order to
obtain better predictive performance than any of the individual algorithms alone. Section 2.2.1
will explore the ID3 decision tree learning algorithm which aims to induce the simplest possible
tree. Sections 2.2.2 and 2.2.3 explore some ensemble methods to machine learning.
2.2.1 ID3
The majority of classification decision tree learning algorithms are variations on an original
central methodology first proposed as the ID3 algorithm (Quinlan, 1986) and later refined to
the C4.5 algorithm (Quinlan, 1993). The characteristics of decision tree algorithms discussed
previously apply to ID3, but it has some subtleties and limitations too. One of these is that
pruning does not apply to ID3 as it does not re-evaluate decision tree solutions after it has
selected one.
Instead, the approach taken by the ID3 algorithm is
to iterate with a top-down greedy search method
through all the possible decision tree outputs from
the simplest possible solution gradually increasing
complexity until the first valid solution. Each
decision tree output is known as a hypothesis and
are effectively different possible solutions to the
model or function โ„Ž. This unidirectional approach
works to reach a consistently satisfactory decision
tree without expensive computation (Quinlan,
1986). However, it implies the algorithm never
backtracks to reconsider earlier choices (Mitchell,
1997). The core decision making lies in deciding
which attribute makes the optimal decision node at
each point. This is solved using the statistical
property ๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› discussed earlier. ID3โ€™s approach
is known as a hill-climbing search, starting with
Figure 2.3. Searching through decision tree
hypotheses from simplest to increasing complexity
as directed by information gain. (Mitchell, 1997)
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
7
empty space and building a decision tree from the top down.
This approach has advantages and disadvantages (Mitchell, 1997; Quinlan, 1986). It can be
considered a positive capability that ID3 in theory considers all possible decision tree
permutations. Some other algorithms take a major risk of evaluating only a portion of the search
space in order to leverage greater speed, but this can lead to inaccuracy. On the other hand, a
problem in ID3 is the โ€˜goldfish memoryโ€™ approach of only considering the current decision tree
hypothesis at any one time. This means that it does not actually calculate how many viable
different decision trees there are, it simply picks the first it reaches, making pruning post-
selection redundant. We consider ID3 an important algorithm to understand because it serves
as a core algorithm that many extensions have developed from. It can easily be modified to
utilise pruning and handle noisy data as well as optimised for less common conditions.
It is important to consider why ID3โ€™s inductive bias towards simpler decision trees is optimal.
Ockhamโ€™s razor approach (Allaby, 2010) advises giving preference to the simplest hypothesis
that fits the data. But stating this does not make it optimal. Why is the simplest solution the
best choice? It can be argued that scientists tend to follow this bias, possibly because it is less
likely that a simpler solution is going to coincide with being the correct solution unless it is a
perfectly accurate generalisation (what we aim to reach in machine learning) (Mitchell, 1997).
Also, there is evidence that this approach will be consistently faster at reaching the solution due
to only considering a portion of the data set (Quinlan, 1986). On the other hand, there are
contradictions in this approach. It is entirely possible to obtain two different solutions with the
exact same data by taking this approach, simply if the iterations by ID3 take two different paths.
This is likely to be acceptable in most applications but may be a crucial complication for others
(Mitchell, 1997).
The C4.5 algorithm (Quinlan, 1993) extended the original ID3 algorithm with increased
computational efficiency, ability to handle training data with missing attributes, ability to
handle continuous attributes (rather than just discrete) and various other improvements. One
of the most significant modifications allowed a new approach to determining the optimal
decision tree solution. Choosing the first simple valid solution can be problematic if there is
noise in the data. This is solved by allowing production of trees which overfit the data and then
pruning them post-induction. Despite a longer sounding process, this new solution was found
to be more successful in practice (Mitchell, 1997).
The ID3 algorithm can be considered a basic but effective algorithm for building decision trees.
With refinement to the C4.5 algorithm, it is competent at producing an adequate solution
without requiring vast computing resources. For this reason, it is extremely well supported and
commonly implemented across numerous programming languages. It is considered a highly
credible algorithm used in engineering (Shao et al., 2001), aviation (Yan, Zhu & Qiang, 2007)
and wherever automated or optimal decision making processes are required.
2.2.2 Bagging and Boosting
Bagging and boosting are ensemble techniques which means they use multiple learning
algorithms to improve the overall performance of the machine learning system (Banfield et al.,
2007). In decision tree learning, this helps to produce the optimal decision tree (rather than just
a valid one). The optimal decision tree is one that has the lowest error rate in predicting outputs
๐‘ฆ for data inputs ๐‘ฅ (Dietterich, 2000a). Bagging and boosting improve the performance by
manipulating the training data before it is fed into the algorithm.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
8
Bagging is an abbreviation of โ€˜Bootstrap AGGregatINGโ€™ (Pino-Mejas et al., 2004) and was first
developed by Leo Breiman in 1994. Bagging takes subset samples from the full training set to
produce groups of training sets called โ€œbagsโ€ (Breiman, 1996). The key methodology of bagging
is to take ๐‘š examples, with replacement, from the original training set. Each bag ends up
containing approximately 63.2% of the original training set (Dietterich, 2000a).
Boosting was first developed by Freund and Schapire in 1995 and similarly manipulates the
example training data in order to improve performance of the decision tree learning algorithm
(Freund & Schapire, 1996; Freund & Schapire, 1995; Freund, 1995). The key differentiator of
boosting is that it assigns a weight to each example proportional to the error in prediction of
considering that data (Banfield et al., 2007). Misclassified examples are given an incrementally
greater weighting in each iteration of the algorithm. In subsequent iterations, the algorithm
focuses on examples with a greater weighting (favouring examples which are harder to classify
than those which consistently classify correctly).
Breiman (1996) identified that bagging improves the performance of unstable learning
algorithms but tends to reduce the performance of more stable algorithms. Decision tree
learning algorithms, neural networks and rule learning algorithms are all unstable, whereas
linear regression and K-nearest neighbour (Larose, 2005) algorithms are very stable. The
improvements offered by bagging and boosting are therefore very relevant to decision tree
learning. But why do bagging and boosting improve the performance of unstable algorithms
whilst degrading stable ones? The main components of error in machine learning algorithms
can be summarised as noise, bias and variance. An unstable learning algorithm is one where
small changes in the training data cause significant fluctuation in the response of the algorithm
(i.e. high variance) (Dietterich, 2000a). In both bagging and boosting, the training data set is
perturbed to reduce the variance of the data by adding linear classifying models and hence
making the algorithm more stable (Skurichina & Duin, 1998). The effect of this is shifting the
focus of the algorithm to the most relevant region of the training data. On the other hand,
adding linear models to an already stable model will make no difference except less examples
will be considered to reach the same solution.
A machine learning algorithm is considered accurate if it produces a model โ„Ž with an accuracy
greater than ยฝ (i.e. the decision tree results in greater accuracy than if each decision made was
a 50/50 split). Algorithms are tested to this limit by adding noise to the training data. Noisy data
is training data which contains mislabelled examples. Noise is problematic for boosting and has
been show to considerably reduce its classification performance (Long & Servedio, 2009;
Dietterich, 2000b; Dietterich, 2000a; Freund & Schapire, 1996). This poor performance is
intuitive due to the fact that the boosting method converges to harder to classify data.
Mislabelled data is obviously the hardest to classify and fruitless to focus on, hence the fatal flaw
of boosting. Critically, Long & Servedio (2009) showed that the most common boosting
algorithms such as AdaBoost and LogitBoost reduced to accuracies of less than ยฝ for high noise
data, rendering them meaningless. Conversely, when directly comparing bagging and boosting
methods effectiveness, Dietterich (2000b) found that bagging was โ€œclearlyโ€ the best method.
Bagging actually uses the noise to generate a more diverse collection of decision tree hypotheses
and therefore introducing noise to the training data only improves accuracy. However,
experimental results have shown that when there is no noise in the training data, boosting gives
the best results (Banfield et al., 2007; Lemmens & Croux, 2006; Dietterich, 2000b; Freund &
Schapire, 1996). Conclusively, when deciding between machine learning algorithms, an
important factor to consider is confidence in the consistency of training data being provided.
Boosting is ideal but bagging is more consistent.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
9
A possible solution to this dilemma is explored by Alfaro, Gamez & Garcia (2013) where features
of both bagging and boosting are combined in the design of a new classification decision tree
learning algorithm: adabag. The common goal of both bagging and boosting is to improve
accuracy by modifying the training data. Based off the AdaBoost algorithm, adabag allows
analysis of the error as the ensemble is grown, reducing the problem with noise.
Bagging and boosting are effective techniques for improving the predictive performance of
machine learning algorithms when applied to decision tree learning. By generating an ensemble
of decision trees and finding the optimal hypothesis analytically, accuracy is increased.
2.2.3 Random Forests
Random forests are another ensemble learning technique used to improve the performance of
algorithms in decision learning. The algorithm was originally developed by Breiman (2001) to
whom the term is trademarked. Random forests was an improvement on his previous technique,
bagging (Breiman, 1996). Instead of choosing one optimal decision tree, random forests uses
multiple and takes the mode hypothesis as the result.
Although there is no single best algorithm for every situation (Wolpert & Macready, 1997),
random forests has proved to be a general top performer without requirements for tuning or
adjustment and notably outperforms both bagging and boosting on accuracy and speed
(Banfield et al., 2007; Svetnik et al., 2003; Breiman, 2001).
Breiman (2001) found that random forests favourably shares the noise-proof properties of
bagging. When compared against AdaBoost, random forests showed little deterioration with 5%
noise whereas AdaBoostโ€™s performance dropped markedly. This is because the Random Forest
technique does not increase weights on specific subsets and so the increased noise has negligible
effect, whilst AdaBoostโ€™s convergence to mislabelled examples causes its accuracy to spiral.
This being said, there is always room for improvement and the Random Forest technique is by
no means perfect. The mechanism of up-voting by decision trees in the random forests is one
possible area for improvement (Robnik-Sikonja, 2004). Margin is a measure of how much a
particular hypothesis is favoured over other hypotheses from the Random Forest decision trees.
By weighting each hypothesis vote with the margin, Robnik-Sikonja (2004) found the prediction
accuracy of random forests improves significantly.
Decision trees are a natural choice in the development of machine learning programs. Within
decision trees, there a number of different algorithms and techniques including the ones
explored here plus others such as CART, CHAID and MARS. Decision trees are important
because they perform with large data sets and are intuitive to use. Furthermore, techniques such
as random forests can improve the robustness, accuracy and speed of the learning method.
3 Unsupervised Machine Learning Algorithms
In unsupervised learning, the onus of learning is even more greatly on the computer program
than the developer. Where in supervised learning you have a full set of inputs and outputs in
your data, for unsupervised learning you only have inputs. The machine learning algorithm
must use this input data alone to extract knowledge. In statistics, the equivalent problem is
known as density estimation; the problem is finding any underlying structures to the unlabelled
data (Alpaydin, 2010).
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
10
3.1 Clustering
The main unsupervised learning method is clustering: finding groups within the input data set.
For example, a company may want to group their current customers in order to target groups
with relevant new products and services. To do this, the company could take their database of
customers and use an unsupervised clustering algorithm to divide it into customer segments.
The company uses the results to have better relationships with their customers. In addition to
identifying groups, the algorithm will identify outliers who sit outside of these groups. These
outliers might reveal a niche that wouldnโ€™t have otherwise been noticed.
There are over 100 published clustering algorithms. This review will focus on the two most used
approaches to clustering: hierarchical clustering and k-means clustering.
3.1.1 Hierarchical Clustering: Agglomerative and Divisive
As suggested in the name, hierarchical clustering clusters
in hierarchies. Each level of clusters in the hierarchy is a
combination of the clusters below it, whereby the
โ€˜clustersโ€™ at the bottom of the hierarchy are singular
observations and the top cluster contains the entire data
set (Hastie, 2009).
Hierarchical clustering is split into two sub-approaches:
agglomerative (bottom-up) and divisive (top-down) as in
Figure 3.1. In the agglomerative approach, clusters start out
as individual data inputs and are merged into larger
clusters until one cluster containing all the inputs is
reached. Divisive is the reverse, starting with the cluster containing all data inputs and
subdividing into smaller clusters until reaching
individual inputs or a termination condition
such as the distance between two of the closest
clusters is above a certain amount (Kamber,
2000). The most common form of hierarchical
clustering is agglomerative. Dendrograms
provide a highly comprehensible way of
interpreting the structure of a hierarchical
clustering algorithm in a graphical format as
illustrated in Figure 3.2.
Agglomerative hierarchical methods are
broken down into single-link methods,
complete-link methods, centroid methods and
more. The difference between these methods is
how the distance between clusters/groups is
measured.
The single-link method, also known as nearest neighbour clustering (Rohlf, 1982), can be
defined by the following distance ๐ท linkage function (Gan, 2007):
๐ท(๐ถ, ๐ถโ€ฒ) =
min
๐‘ฅ โˆˆ ๐ถ, ๐‘ฆ โˆˆ ๐ถโ€ฒ
๐‘‘(๐‘ฅ, ๐‘ฆ) Eq. 3.1
Figure 3.2. Dendrogram from agglomerative (bottom up)
clustering technique based on data on human tumors.
(Hastie, 2009)
Figure 3.1. Agglomerative and divisive
hierarchical clustering. (Gan, 2007)
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
11
Where ๐ถ and ๐ถโ€™ are two nonempty and non-overlapping clusters. The Euclidean distance (Gan,
2007) for ๐‘›-dimensions is:
๐‘‘(๐‘ฅ, ๐‘ฆ) = โˆš(๐‘ฅ1 โˆ’ ๐‘ฆ1)2 + (๐‘ฅ2 โˆ’ ๐‘ฆ2)2 + โ‹ฏ + (๐‘ฅ ๐‘› โˆ’ ๐‘ฆ๐‘›)2 Eq. 3.2
This is used in the agglomerative approach to find clusters/groups with the minimum Euclidean
distance between them to join for the next level up in the hierarchy. This procedure repeats
until all clusters are encompassed by one cluster of the entire data set.
One of the main reasons hierarchical clustering is such a popular approach is the easily human-
interpretable dendrogram format with which it can be represented (Hastie, 2009). Additionally,
any reasonable method of measuring the distance between clusters can be used provided it can
be applied to matrices. However, hierarchical clustering occasionally encounters difficulty with
merge/split points (Kamber, 2000). In a hierarchical structure, this is critical as every point
following a merge/split is derived from that decision. Therefore, if this decision is made poorly,
the entire output will be low-quality. A number of hierarchical methods built from the
fundamentals of this approach have been designed to solve the typical issues it is prone to,
including BIRCH (Zhang, Ramakrishnan & Livny, 1997) and CURE (Yun-Tao Qian, Qing-Song
Shi & Qi Wang, 2002).
Hierarchical clustering is a simple but extremely flexible approach for applying unsupervised
learning to any data set. It can be used as an assistive tool to allow specialists to make best use
of their skill. For example, in medical applications such as analysis of EEG graphs, hierarchical
clustering is used to identify and group sections that are alike whilst the neurologist can
evaluate the medical meaning of these areas (Guess & Wilson, 2002). In this way, the work is
delegated to make best use of each individual/component: the computer does the systematic
analysis and the neurologist provides the medical insight.
3.1.2 K-means
K-means is one of the most common approaches to clustering. First demonstrated by
MacQueen (1966) it is designed for quantitative data and defines clusters by a centre point (the
mean). The algorithm begins with the initialisation phase where the number of clusters/centres
is fixed. Then the algorithm enters the iteration phase, iterating the positions of these centres
until they reach a final central rest position (Gan, 2007). The final rest position occurs when the
error function does not change significantly for further iterations. The algorithm is as follows
(Hastie, 2009):
1. For a given set of ๐‘˜ clusters, C, minimise the total cluster variance of all data inputs with
respect to {๐‘š1, โ€ฆ , ๐‘š ๐‘˜} yielding the means of current clusters.
2. Given the means of current clusters {๐‘š1, โ€ฆ , ๐‘š ๐‘˜}, assign each data input to the closest
(current) mean for a cluster.
3. Repeat until assignments no longer change.
The function being minimised is as follows (Hastie, 2009):
๐ถโˆ—
=
min
๐ถ, {๐‘š ๐‘˜}1
๐พ โˆ‘ ๐‘๐‘˜ โˆ‘ ||๐‘ฅ๐‘– โˆ’ ๐‘š ๐‘˜||
2
๐ถ(๐‘–)=๐‘˜
๐พ
๐‘˜=1
Eq. 3.3
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
12
Where ๐‘ฅ represents the data inputs
and ๐‘๐‘˜ = โˆ‘ ๐ผ(๐ถ(๐‘–)) = ๐‘˜)๐‘
๐‘–=1 . Therefore
๐‘ data inputs are assigned to the ๐‘˜
clusters so that the distance between
the data inputs and the cluster mean is
minimised.
A key advantage to using K-means is
that it is effective in terms of
computation even with large data sets.
The computational complexity is
linearly proportional to the size of the
data set, rather than exponentially
(Hastie, 2009). However, due to this
linear approach, it can be slow on high
dimensional data beyond a critical size
(Harrington, 2012; Hastie, 2009).
The performance of K-means is heavily dependent on the initialisation phase. Not only must
the number of clusters ๐‘˜ be defined but also the initiation positions of the centres. The number
of clusters ๐‘˜ depends on the goal you are trying to achieve in the analysis and is usually well
defined in the problem, for example, creating ๐‘˜ customer segments, employing ๐‘˜ sales people
etc. Alternatively, if this information is unavailable, a โ€œrule of thumbโ€ approach commonly taken
is to set ๐‘˜ proportionally to the number of inputs in the data set (Mardia, 1979):
๐‘˜ โ‰ˆ โˆš
๐‘
2
Eq. 3.4
For the algorithm to perform well, it is important to take a reliable approach to defining the
cluster means. Fortunately this problem has popular solutions proposed as the Forgy Approach
(Anderberg, 1973), Macqueen Approach (MacQueen, 1966) and Kaufman Approach (Kaufman,
1990). In comparing these, it has been found that the Kaufman approach generally produces the
best clustering results (Peรฑa, Lozano & Larraรฑaga, 1999). In the Kaufman Approach, the initial
cluster means are found iteratively. The starting point is the input data point closest to the
centre of the data set. Following this, centres are chosen by choosing input data point positions
with the highest number of other data points around them.
One of the earliest applications of K-means was in signal and data processing. For example, it is
used for image compression where a 24 bits image with up to 16 million colours can be
compressed to an 8 bits images with only 256 (Alpaydin, 2010). The problem is finding the
optimal 256 colours out of the 16 million in order to retain image quality in compression. This
is a problem of vector quantisation. K-means is still used for this application today.
Figure 3.3. A demonstration of iterations by the K-means clustering
algorithm for simulated input data points. (Hastie, 2009)
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
13
The standard K-means algorithm serves its purpose well, but suffers from some limitations and
drawbacks. For this reason, it has been modified, extended and improved in numerous
publications (Chen, Ching & Lin, 2004; Wagstaff et al., 2001). The techniques employed include
a) finding better initial solutions (as discussed above), b) modifying the original algorithm and
c) incorporating techniques from other algorithms into K-means. Wagstaff et al. (2001)
recognised that the experimenter running the algorithm is likely to have some background
knowledge on the data set being analysed. By communicating this data to the algorithm,
through adding additional constraints in the clustering process, Wagstaff et al. (2001) improved
the performance of K-means from 58% to 98.6%. In a separate experiment, Chen, Ching & Lin
(2004) found that incorporating techniques from hierarchical methods into K-means increased
clustering accuracy. This literature shows that K-means is a versatile approach to clustering
which can be tailored to specific problems in order to significantly improve its accuracy.
4 Steps in developing a machine learning application
So far this review has focused on the theoretical background of machine learning techniques.
This section considers practically applying this theoretical knowledge to data related problems
in any field of work, from collecting data through to use of the application (Harrington, 2012).
4.1 Collect Data
The first step is to collect the data you wish to analyse. Sources of data may include scraping a
website for data, extracting information from an RSS feed or API, existing databases, running
an experiment to collect data and other sources of publicly available data.
4.2 Choose Algorithm
There are a huge number of machine learning algorithms out there, so how do we choose the
right one? The first decision is between supervised learning and unsupervised learning. If you
are attempting to predict or forecast then you should use supervised learning. You will also need
training data with a set of inputs connected to outputs. Otherwise, you should consider
unsupervised learning. At the next level, choose between regression or classification (supervised
learning) and clustering or density estimation (unsupervised learning). Finally at the last level,
there are tens of different algorithms you could use under each of these categories. There is no
single best algorithm for all problems (Harrington, 2012; Wolpert & Macready, 1997).
Understanding the properties of the algorithms is helpful, but now to find the best algorithm
for your problem your strategy should be to test different algorithms and choose by trial and
error (Salter-Townshend et al., 2012).
4.3 Prepare Data
The next step is to prepare the data in a usable format. Certain algorithms require the
features/training data to be formatted in a particular way, but this is trivial. The data first needs
to be cleaned and, integrated and selected (Zhang, Zhang & Yang, 2003; Kamber, 2000).
Data cleaning involves filling out any missing values in features of the training data, removing
noise, filtering out outliers and correcting inconsistent data. To fill out missing values, you can
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
14
take a biased or unbiased approach. An example of biased is to use a probable value to fill in the
missing value, whereas unbiased would be just removing the feature/example completely. The
biased approach is popular when there are a large proportion of values missing. The random
error and variance in the data is caused by noise. This is reduced by binning (Shi & Yu, 2006) or
clustering the data in order to isolate and remove outliers.
Data integration is simply merging data from multiple sources. Data selection is the problem of
selecting the right data from the sample to use as the training data set. Generally the method of
selecting data is heavily dependent on the type of data being filtered, however, Sun et al. (2013)
explored an innovative generalised approach using dynamic weights for classification by putting
a greater weight on data associated with the most features and eliminating redundant ones,
demonstrating promising results.
4.4 Train Algorithm
Now that all the data is cleaned and optimised, we can proceed to train the algorithm (for
supervised learning). For unsupervised learning, this stage is just running the algorithm on the
data as we donโ€™t have target values to train with. For both learning types, this is where the
artificially intelligent โ€˜machine learningโ€™ occurs and where the real value of machine learning
algorithms is exploited (Russell, 2010). The output of this step is raw โ€˜knowledgeโ€™.
4.5 Verify Results
Before using the new found โ€˜knowledgeโ€™, it is important to verify/test it. In supervised learning,
you can test the model youโ€™ve created against your existing real data set to measure the accuracy.
If it is not satisfactory, you can go back to the initial data preparation stages and optimise.
Verifying the accuracy of unsupervised learning algorithms is significantly more challenging
and beyond the scope of this review.
4.6 Use Application
Finally you can use the knowledge evaluated by your algorithm. Depending on the nature of
your machine learning problem, the raw data output may be sufficient or you may choose to
produce visualisations for the results (Leban, 2013).
The beauty of machine learning means that we do not need to program a solution to the
problem line by line, the machine learning algorithm will learn from data using statistical
analysis instead. But the machine learning algorithm still needs to be developed itself.
Fortunately there is no single piece of software or programming language that you must use to
prepare your machine learning application. The most commonly used applications are Python,
Octave, R and Matlab (Ng, 2014; Freitas, 2013; Alfaro, Gamez & Garcia, 2013; Harrington, 2012).
Python is one of the most widely used because of its clear syntax, simple text manipulation and
established use throughout industries and organisations (Harrington, 2012).
With this information, you are now equipped with the knowledge and practical know-how to
develop a machine learning application.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
15
5 Data Mining
In the last few centuries, innovation in the human species has accelerated rapidly. With the
invention of the World Wide Web and adoption of new technologies on a global scale we are
using technology like never before. The by-product of the Information Age is vast amounts of
data, exceeding terabytes onto petabytes and exabytes, with immense hidden value (Goodman,
Kamath & Kumar, 2007). The sheer size of databases and data sets make it impossible for a
human to comprehend or analyse manually. Data mining is quite literally using machine
learning approaches to extract underlying information and knowledge from data (Kamber,
2000). The knowledge can contribute greatly to business strategies or scientific and medical
research.
The format of the knowledge extracted depends on the machine learning algorithm used. If
supervised learning approaches are applied it is possible to identify patterns in data that can be
used to model it (Kantardzic, 2011). Pattern recognition and learning is one of the most widely
applied uses for data mining and machine learning.
Unsupervised approaches are also used in data mining. Unsupervised learning makes it possible
to identify natural groupings in data. The main application of this in data mining is feature
learning whereby useful features are extracted from a large data set which can then be used for
classification (Coates & Ng, 2012).
Applications of data mining can be seen in medicine, telecommunications, finance, science,
engineering and more. For example in medicine, machine learning is frequently being used to
improve diagnosis of medical conditions such as cancer and schizophrenia. Data mining of
clinical data such as MRI scans allows computers to learn how to recognise cancers and
underlying conditions in new patients more reliably than doctors (Savage, 2012; Ryszard S
Michalski, Ivan Bratko & Miroslav Kubat, 1998). In finance, data mining is now being used to
assist evaluation of credit risk of individuals and companies ahead of providing financial support
through loans (Correia et al., 1993). This is arguably the most important stage in the process of
offering a loan but firms have previously struggled to accurately predict the risk of default. With
the large data sets that have been accumulated in this domain, data mining is providing new
insights and patterns to help accurately manage these risks for financial organisations.
Data mining does not yet have any social stigma attached to it. However, there are ethical issues
and social impacts of data mining. For example, web mining involves scraping data from the
internet and mining it for knowledge (Etzioni, 1996). This data can often include personal data
from web users which is used for the profit of organisations (the web miners) (Van Wel &
Royakkers, 2004). Current research suggests that no harm is currently being done to web users
as a result of this, but with the uprising of โ€˜big dataโ€™ there is growing demand for regulation and
ensuring that the power of data mining is used for โ€˜goodโ€™ (Etlinger, 2014). As long as users remain
in control and fully understand the data they offer when using the web, the threat to privacy
can be neutralised. However, the risk of this line of consent and understanding becoming
blurred is high. It is important for governments and organisations to acknowledge this and take
a pro-active approach with regulation.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
16
6 Discussion
6.1 Literature
In writing this review it has become clear that supervised machine learning algorithms simply
apply statistical approaches to data analysis in a scalable way. In fact, one of the best technical
sources of information on regression and gradient descent was a maths textbook (Kreyszig,
2006). It provided a clear explanation of the techniques despite not directly relating them to
machine learning. This has demonstrated that machine learning has come a long way in its
scientific and mathematical approach since originally branching out of artificial intelligence.
The cause of the separation was originally due to statistical analysis no longer being supported
in artificial intelligence. However, it turned out that within these statistical analysis approaches
(machine learning) lied the most practical discoveries and applications of all.
Unsupervised learning is perhaps more closely related to artificial intelligence. The frequently
cited textbook by Russell (2010) titled โ€œArtificial Intelligenceโ€ actually served as an excellent
source of insight into unsupervised machine learning algorithms, particularly hierarchical
algorithms and the K-means approach. This is probably because unsupervised learning deals
with the more mysterious (affiliated with artificial intelligence) type of data: unlabelled data.
Additionally, it seeks to extract knowledge or โ€˜intelligenceโ€™ from this data. Unsupervised
learning is particularly applicable to data mining through the application of feature learning.
With feature learning, it is possible to take a huge set of data uninterpretable by humans and
turn it into something that you can perform intricate data analysis on and obtain realised value.
It was surprising to find that with just the elemental principals covered in this review it is
possible to get started on real machine learning applications, as made apparent when discussing
the review with professionals in industry.
6.2 Future Developments
Machine learning is still a new scientific field with huge opportunities for growth and
development. Rather than working on large static data sets, it is important to devise methods
of applying machine learning to transient data and data streams (Gama, 2012). There are
significant challenges to address for maintaining an accurate decision model when the data used
to develop that model is continually changing.
It has become clear that a bias-variance trade off exists in supervised learning problems
(Sharma, Aiken & Nori, 2014). Bias and variance are both sources of error. Ideally the model
should closely fit the training data but also generalise effectively for new data. In past research,
there has been a focus on reducing the variance related error. However, as data sets grow larger
(Cambria et al., 2013), it is important to produce models which fit closely to larger data sets.
Therefore, there is a need to focus more specifically on bias related error.
We now have access to more computational power than ever before. However, when comparing
computing technology to the human brain, there is a clear discrepancy between the two in terms
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
17
of how fast data is processed and how much energy is consumed to do so (Norvig, 2012). A
computer can process data 100 million times faster than the brain but requires 20,000 watts of
power to do so. Comparatively, the brain consumes just 20 watts of power to do the same. Yet
machine learning systems are still only just managing to become as effective as the brain. We
need to allocate resources to understanding the brain and using it to inspire circuit and
machinery design in order to make artificial intelligence and learning processes more efficient.
7 Conclusion
There are two main approaches to machine learning: supervised learning and unsupervised
learning. These can be further broken down by different algorithms used to complete supervised
and unsupervised learning tasks. In supervised learning, types of algorithm include regression
and clustering (such as gradient descent, ID3, bagging, boosting and random forests). In
unsupervised learning, types of algorithms include hierarchical and K-means clustering.
Machine learning can be applied to facial recognition, medical diagnosis, search engines,
shopping cart recommendation systems and much more. The common indicator of a good
application is that a large source of data exists related to the problem. Machine learning
algorithms can then use their tailored decision making to translate that data into usable
knowledge, producing value.
The process of developing a machine learning algorithm is summarised as follows: start by
collecting data, choose an appropriate algorithm, prepare the data, train the algorithm with
sample data, verify the results and finally apply the knowledge produced by the algorithm.
Data mining is a growing application of machine learning as the World Wide Web and
Information Age have introduced data sets on a scale like never before. Going forward, it is
important to only use data mining ethically and not to the detriment of web users.
As most of the development in machine learning has happened in the past 30 years, there is still
much to be done. We should continue to use the human brain as a North Star in guiding further
research. The goal is to realise true artificial intelligence through improving machine learning
algorithms which may one day compete with the performance of our own brains.
8 References
Akaike, H. (1974) NEW LOOK AT THE STATISTICAL MODEL IDENTIFICATION. IEEE
Transactions on Automatic Control. AC-19 (6), 716-723.
Alfaro, E., Gamez, M. & Garcia, N. (2013) adabag: An R Package for Classification with Boosting
and Bagging. Journal of Statistical Software; J.Stat.Softw. 54 (2), 1-35.
Allaby, M. (2010) Ockham's razor, A Dictionary of Ecology. Oxford University Press.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
18
Alpaydin, E. (2010) Introduction to machine learning. 2nd edition. Cambridge, Mass. ; London,
MIT Press.
Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London, Academic Press.
Ayodele, T. O. (2010) Types of Machine Learning Algorithms, New Advances in Machine
Learning, Yagang Zhang (Ed.), ISBN: 978-953-307-034-6, InTech.
Banfield, R. E., Hall, L. O., Bowyer, K. W. & Kegelmeyer, K. W. (2007) A comparison of
decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and
Machine Intelligence. 29 (1), 173-180.
Bartholomew-Biggs, M. (2008) Nonlinear Optimization with Engineering Applications.
Dordrecht, Springer.
Beyad, Y. & Maeder, M. (2013) Multivariate linear regression with missing values. Analytica
Chimica Acta. 796 (0), 38-41.
Breiman, L. (1996) Bagging predictors. Machine Learning. 24 (2), 123-140.
Breiman, L. (2001) Random Forests. Machine Learning. 45 (1), 5-32.
Cambria, E., Huang, G., Zhou, H., Vong, C., Lin, J., Yin, J., Cai, Z., Liu, Q., Li, K., Feng, L., Ong,
Y., Lim, M., Akusok, A., Lendasse, A., Corona, F., Nian, R., Miche, Y., Gastaldo, P., Zunino, R.,
Decherchi, S., Yang, X., Mao, K., Oh, B., Jeon, J., Toh, K., Kim, J., Yu, H., Chen, Y. & Liu, J.
(2013) Extreme Learning Machines. IEEE Intelligent Systems. 28 (6), 30-59.
Chen, J., Ching, R. K. H. & Lin, Y. (2004) An extended study of the K- means algorithm for data
clustering and its applications. Journal of the Operational Research Society. 55 (9), 976-987.
Coates, A. & Ng, A. Y. (2012) Learning feature representations with K- means. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes
in Bioinformatics). 7700, 561-580.
Correia, J., Costa, E., Ferreira, J. & Jamet, T. (1993) An Application of Machine Learning in the
Domain of Loan Analysis. Lecture Notes in Computer Science. 667, 414-419.
Criminisi, A. & Shotton, J. (2013) Decision Forests for Computer Vision and Medical Image
Analysis. 2013th edition.
Dietterich, T. (2000a) Ensemble methods in machine learning. Multiple Classifier Systems.
1857, 1-15.
Dietterich, T. (2000b) An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning;
Mach.Learn. 40 (2), 139-157.
Etlinger, S. (2014) What do we do with all this big data? TED.com,
https://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
19
Etzioni, O. (1996) The World- Wide Web: Quagmire or Gold Mine? Communications of the
ACM. 39 (11), 65-68.
Freitas, N. d. (2013) Machine Learning Lecture Course. de Freitas, Nando, University of British
Columbia, Oxford University.
Freund, Y. & Schapire, R. E. (1996) Experiments with a new boosting algorithm. ICML. pp.148-
156.
Freund, Y. (1995) BOOSTING A WEAK LEARNING ALGORITHM BY MAJORITY. Information
and Computation; Inf.Comput. 121 (2), 256-285.
Freund, Y. & Schapire, R. E. (1995) A decision- theoretic generalization of on-line learning and
an application to boosting. Lecture Notes in Computer Science. 904, 23-37.
Gama, J. (2012) A survey on learning from data streams: current and future trends. Progress in
Artificial Intelligence. 1 (1), 45-55.
Gan, G. (2007) Data clustering : theory, algorithms, and applications. Philadelphia, PA, Society
for Industrial and Applied Mathematics.
Goodman, A., Kamath, C. & Kumar, V. (2007) Statistical analysis and data mining: Data
Analysis in the 21st Century. Statistical Analysis and Data Mining. , .
Guess, M. J. & Wilson, S. B. (2002) Introduction to hierarchical clustering. Journal of Clinical
Neurophysiology. 19 (2), 144-151.
Harrington, P., 1977-. (2012) Machine learning in action. Shelter Island, N.Y., Manning
Publications.
Hastie, T. (2009) The elements of statistical learning : data mining, inference, and prediction.
2nd edition. New York, Springer.
Kamber, M. (2000) Data mining: concepts and techniques. San Francisco ; London, San
Francisco ; London Morgan Kaufmann.
Kantardzic, M. (2011) Data Mining Concepts, Models, Methods, and Algorithms. 2nd edition.
Hoboken, Wiley.
Kaufman, L. (1990) Finding groups in data an introduction to cluster analysis. S.l.}, Wiley.
Kiwiel, K. C. (2001) Convergence and efficiency of subgradient methods for quasiconvex
minimization. Mathematical Programming, Series B. 90 (1), 1-25.
Kreyszig, E. (2006) Advanced engineering mathematics. 9th, International edition. Hoboken,
N.J., Wiley.
Larose, D. T. (2005) k โ€ Nearest Neighbor Algorithm. Hoboken, NJ, USA.
Leban, G. (2013) Information visualization using machine learning. Informatica (Slovenia). 37
(1), 109-110.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
20
Lemmens, A. & Croux, C. (2006) Bagging and boosting classification trees to predict churn.
Journal of Marketing Research. , .
Long, P. M. & Servedio, R. A. (2009) Random classification noise defeats all convex potential
boosters. Machine Learning. , 1-18.
MacQueen, J. B. (1966) SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF
MULTIVARIATE OBSERVATIONS.
Mardia, K. V. (1979) Multivariate analysis. London, Academic Press.
Mingers, J. (1989) An empirical comparison of selection measures for decision-tree induction.
Machine Learning. 3 (4), 319-342.
Mitchell, T. M. (. M., 1951-. (1997) Machine learning. Boston, Mass., WCB/McGraw-Hill.
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. (2004) An introduction to
decision tree modeling. Journal of Chemometrics. 18 (6), 275-285.
Ng, A. (2014) Machine Learning (Coursera) - Stanford by Andrew Ng. , coursera.org.
Norvig, P. (2012) Artificial intelligence: A new future. New Scientist. 216 (2889), vi-vii.
Peรฑa, J. M., Lozano, J. A. & Larraรฑaga, P. (1999) An empirical comparison of four initialization
methods for the K-Means algorithm. Pattern Recognition Letters. 20 (10), 1027-1040.
Pino-Mejas, R., Cubiles-de-la-Vega, M., Lapez-Coello, M., Silva-Ramarez, E. & Jimanez-
Gamero, M. (2004) Bagging Classification Models with Reduced Bootstrap. In: Fred, A., Caelli,
T., Duin, R. W., Campilho, A. & de Ridder, D. (eds.). , Springer Berlin Heidelberg. pp. 966-973.
Quinlan, J. R. (1993) C4.5 : programs for machine learning. Amsterdam, Morgan Kaufmann.
Quinlan, J. R. (1986) Induction of decision trees. Machine Learning. 1 (1), 81-106.
Robnik-Sikonja, M. (2004) Improving random forests. Machine Learning: Ecml 2004,
Proceedings. 3201, 359-370.
Rohlf, F. J. (1982) 12 Single- link clustering algorithms. Handbook of Statistics. 2, 267-284.
Russell, S. J. (. J. (2010) Artificial intelligence : a modern approach. 3rd, International edition.
Boston, Mass.] ; London, Pearson.
Ryszard S Michalski, Ivan Bratko & Miroslav Kubat. (1998) Machine learning and data mining :
methods and applications. Chichester, Chichester : Wiley.
Salter-Townshend, M., White, A., Gollini, I. & Murphy, T. B. (2012) Review of statistical
network analysis: models, algorithms, and software. Statistical Analysis and Data Mining. 5
(4), 243-264.
Savage, N. (2012) Better Medicine Through Machine Learning. Communications of the ACM. 55
(1), 17-19.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
21
Shao, X., Zhang, G., Li, P. & Chen, Y. (2001) Application of ID3 algorithm in knowledge
acquisition for tolerance design. Journal of Materials Processing Tech. 117 (1), 66-74.
Sharma, R., Aiken, A. & Nori, A. V. (2014) Bias- variance tradeoffs in program analysis.
Shi, T. & Yu, B. (2006) Machine Learning and Data Mining - Binning in Gaussian kernel
regularization. Statistica Sinica. 16 (2), 541-568.
Skurichina, M. & Duin, R. P. W. (1998) Bagging for linear classifiers. Pattern Recognition. 31
(7), 909-930.
Snyman, J. A. (2005) Practical Mathematical Optimization An Introduction to Basic
Optimization Theory and Classical and New Gradient-based Algorithms. Dordrecht, Springer-
Verlag New York Inc.
Stigler, S. M. (1981) Gauss and the Invention of Least Squares. The Annals of Statistics. 9 (3),
465-474.
Sun, X., Liu, Y., Chen, H., Han, J., Wang, K. & Xu, M. (2013) Feature selection using dynamic
weights for classification. Knowledge-Based Systems. 37, 541-549.
Svetnik, V., Liaw, A., Tong, C., Culberson, J., Sheridan, R. & Feuston, B. (2003) Random forest:
A classification and regression tool for compound classification and QSAR modeling. Journal
of Chemical Information and Computer Sciences; J.Chem.Inf.Comput.Sci. 43 (6), 1947-1958.
Van Wel, L. & Royakkers, L. (2004) Ethical issues in web data mining. Ethics and Information
Technology. 6 (2), 129-140.
Wagstaff, K., Cardie, C., Rogers, S. & Schrรถdl, S. (2001) Constrained k-means clustering with
background knowledge. ICML. pp.577-584.
Wolpert, D. H. & Macready, W. G. (1997) No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation. 1 (1), 67-82.
Yan, K., Zhu, J. & Qiang, S. (2007) The application of ID3 algorithm in aviation marketing.
Yun-Tao Qian, Y. Q., Qing-Song Shi, Q. S. & Qi Wang, Q. W. (2002) CURE-NS: a hierarchical
clustering algorithm with new shrinking scheme.
Zhang, S. C., Zhang, C. Q. & Yang, Q. (2003) Data preparation for data mining. Applied
Artificial Intelligence. 17 (5-6), 375-381.
Zhang, T., Ramakrishnan, R. & Livny, M. (1997) BIRCH: A New Data Clustering Algorithm and
Its Applications. Data Mining and Knowledge Discovery. 1 (2), 141-182.
AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
22
9 Acknowledgements
The author would like to acknowledge and thank Dr Frederic Cegla (Senior Lecturer at Imperial
College London) for his supervision of this literature review project. Additionally, Shaun
Dowling (Co-founder at Interpretive.io), Barney Hussey-Yeo (Data Scientist at Wonga), Ferenc
Huszar (Data Scientist at Balderton Capital) and Joseph Root (Co-founder at Permutive.com)
for sharing their insights on machine learning.

More Related Content

What's hot

Bb0020 managing information
Bb0020  managing informationBb0020  managing information
Bb0020 managing informationsmumbahelp
ย 
โ€œImprovingโ€ prediction of human behavior using behavior modification
โ€œImprovingโ€ prediction of human behavior using behavior modificationโ€œImprovingโ€ prediction of human behavior using behavior modification
โ€œImprovingโ€ prediction of human behavior using behavior modificationGalit Shmueli
ย 
Survey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data MiningSurvey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data Miningijcsit
ย 
Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...IQ Online Training
ย 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...IJECEIAES
ย 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataIRJET Journal
ย 
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...Simplilearn
ย 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
ย 
Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...IJECEIAES
ย 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchGalit Shmueli
ย 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
ย 
A SMAC based Business Model for Data Embezzlement System
A SMAC based Business Model for Data Embezzlement SystemA SMAC based Business Model for Data Embezzlement System
A SMAC based Business Model for Data Embezzlement SystemEditor IJMTER
ย 
Evaluating the impact of removing less important terms on sentiment analysis
Evaluating the impact of removing less important terms on sentiment analysisEvaluating the impact of removing less important terms on sentiment analysis
Evaluating the impact of removing less important terms on sentiment analysisConference Papers
ย 
F033026029
F033026029F033026029
F033026029ijceronline
ย 
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...Bryan Guibijar
ย 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insuranceStefano Perfetti
ย 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industryStefano Perfetti
ย 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
ย 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...PhD Assistance
ย 

What's hot (20)

Bb0020 managing information
Bb0020  managing informationBb0020  managing information
Bb0020 managing information
ย 
How has ai changed manufacturing
How has ai changed manufacturingHow has ai changed manufacturing
How has ai changed manufacturing
ย 
โ€œImprovingโ€ prediction of human behavior using behavior modification
โ€œImprovingโ€ prediction of human behavior using behavior modificationโ€œImprovingโ€ prediction of human behavior using behavior modification
โ€œImprovingโ€ prediction of human behavior using behavior modification
ย 
Survey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data MiningSurvey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data Mining
ย 
Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...Data Science tutorial for beginner level to advanced level | Data Science pro...
Data Science tutorial for beginner level to advanced level | Data Science pro...
ย 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
ย 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
ย 
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...
Machine Learning Engineer Salary, Roles And Responsibilities, Skills and Resu...
ย 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
ย 
Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...Empirical analysis of ensemble methods for the classification of robocalls in...
Empirical analysis of ensemble methods for the classification of robocalls in...
ย 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
ย 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
ย 
A SMAC based Business Model for Data Embezzlement System
A SMAC based Business Model for Data Embezzlement SystemA SMAC based Business Model for Data Embezzlement System
A SMAC based Business Model for Data Embezzlement System
ย 
Evaluating the impact of removing less important terms on sentiment analysis
Evaluating the impact of removing less important terms on sentiment analysisEvaluating the impact of removing less important terms on sentiment analysis
Evaluating the impact of removing less important terms on sentiment analysis
ย 
F033026029
F033026029F033026029
F033026029
ย 
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...
Data Flow Diagram (DFD) in Developing Online Product Monitoring System (OPMS)...
ย 
The data science revolution in insurance
The data science revolution in insuranceThe data science revolution in insurance
The data science revolution in insurance
ย 
Data science landscape in the insurance industry
Data science landscape in the insurance industryData science landscape in the insurance industry
Data science landscape in the insurance industry
ย 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
ย 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
ย 

Similar to Guy Riese Literature Review

Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learningHarshitBarde
ย 
Eckovation Machine Learning
Eckovation Machine LearningEckovation Machine Learning
Eckovation Machine LearningShikhar Srivastava
ย 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYashShiva3
ย 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
ย 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfTemok IT Services
ย 
Data Mining
Data MiningData Mining
Data MiningAnbreenJaved
ย 
McKinsey Global Institute Big data The next frontier for innova.docx
McKinsey Global Institute Big data The next frontier for innova.docxMcKinsey Global Institute Big data The next frontier for innova.docx
McKinsey Global Institute Big data The next frontier for innova.docxandreecapon
ย 
Odsc machine-learning-guide-v1
Odsc machine-learning-guide-v1Odsc machine-learning-guide-v1
Odsc machine-learning-guide-v1Harsh Khatke
ย 
IRJET- Machine Learning: Introduction, Algorithms and Implementation
IRJET-  	  Machine Learning: Introduction, Algorithms and ImplementationIRJET-  	  Machine Learning: Introduction, Algorithms and Implementation
IRJET- Machine Learning: Introduction, Algorithms and ImplementationIRJET Journal
ย 
A LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMININGA LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMININGCarrie Romero
ย 
Mastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxMastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxworkshayesteh
ย 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information AgeIIRindia
ย 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_linjinxing lin
ย 
Given the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theGiven the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theMatthewTennant613
ย 
Machine Learning Tutorial for Beginners
Machine Learning Tutorial for BeginnersMachine Learning Tutorial for Beginners
Machine Learning Tutorial for Beginnersgrinu
ย 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningmaldonadojorge
ย 
Training_Report_on_Machine_Learning.docx
Training_Report_on_Machine_Learning.docxTraining_Report_on_Machine_Learning.docx
Training_Report_on_Machine_Learning.docxShubhamBishnoi14
ย 
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesData Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesIJAEMSJORNAL
ย 
Machine Learning
Machine LearningMachine Learning
Machine LearningM Abhishek Dora
ย 

Similar to Guy Riese Literature Review (20)

Case study on machine learning
Case study on machine learningCase study on machine learning
Case study on machine learning
ย 
Eckovation Machine Learning
Eckovation Machine LearningEckovation Machine Learning
Eckovation Machine Learning
ย 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptx
ย 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
ย 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
ย 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdf
ย 
Data Mining
Data MiningData Mining
Data Mining
ย 
McKinsey Global Institute Big data The next frontier for innova.docx
McKinsey Global Institute Big data The next frontier for innova.docxMcKinsey Global Institute Big data The next frontier for innova.docx
McKinsey Global Institute Big data The next frontier for innova.docx
ย 
Odsc machine-learning-guide-v1
Odsc machine-learning-guide-v1Odsc machine-learning-guide-v1
Odsc machine-learning-guide-v1
ย 
IRJET- Machine Learning: Introduction, Algorithms and Implementation
IRJET-  	  Machine Learning: Introduction, Algorithms and ImplementationIRJET-  	  Machine Learning: Introduction, Algorithms and Implementation
IRJET- Machine Learning: Introduction, Algorithms and Implementation
ย 
A LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMININGA LITERATURE REVIEW ON DATAMINING
A LITERATURE REVIEW ON DATAMINING
ย 
Mastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docxMastering Data Science A Comprehensive Introduction.docx
Mastering Data Science A Comprehensive Introduction.docx
ย 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information Age
ย 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
ย 
Given the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by theGiven the scenario, your role, and the information provided by the
Given the scenario, your role, and the information provided by the
ย 
Machine Learning Tutorial for Beginners
Machine Learning Tutorial for BeginnersMachine Learning Tutorial for Beginners
Machine Learning Tutorial for Beginners
ย 
Barga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learningBarga, roger. predictive analytics with microsoft azure machine learning
Barga, roger. predictive analytics with microsoft azure machine learning
ย 
Training_Report_on_Machine_Learning.docx
Training_Report_on_Machine_Learning.docxTraining_Report_on_Machine_Learning.docx
Training_Report_on_Machine_Learning.docx
ย 
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient TechniquesData Mining Framework for Network Intrusion Detection using Efficient Techniques
Data Mining Framework for Network Intrusion Detection using Efficient Techniques
ย 
Machine Learning
Machine LearningMachine Learning
Machine Learning
ย 

Guy Riese Literature Review

  • 1. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING IMPERIAL COLLEGE LONDON AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING DEPARTMENT OF MECHANICAL ENGINEERING GUY RIESE 19/12/2014
  • 2. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING i Abstract This review aims to provide an introduction to machine learning by reviewing literature on the subject of supervised and unsupervised machine learning algorithms, development of applications and data mining. In supervised learning, the focus is on regression and classification approaches. ID3, bagging, boosting and random forests are explored in detail. In unsupervised learning, hierarchical and K-means clustering are studied. The development of a machine learning application starts by collecting and preparing data, then choosing and training an algorithm and finally using your application. Large data sets are on the rise with growing use of the World Wide Web, opening up opportunities in data mining where it is possible to extract knowledge from raw data. It is found that machine learning has a vast range of applications in everyday life and industry. The elementary introduction provided by this review offers the reader a sound foundational basis with which to begin experimentation and exploration of machine learning applications in more depth.
  • 3. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING ii Contents 1 Introduction..............................................................................................................................1 1.1 Objectives......................................................................................................................... 2 2 Supervised Machine Learning Algorithms ............................................................................. 2 2.1 Regression ........................................................................................................................ 3 2.2 Classification Decision Tree Learning............................................................................. 5 2.2.1 ID3.............................................................................................................................6 2.2.2 Bagging and Boosting............................................................................................... 7 2.2.3 Random Forests........................................................................................................9 3 Unsupervised Machine Learning Algorithms.........................................................................9 3.1 Clustering........................................................................................................................10 3.1.1 Hierarchical Clustering: Agglomerative and Divisive............................................10 3.1.2 K-means....................................................................................................................11 4 Steps in developing a machine learning application............................................................. 13 4.1 Collect Data..................................................................................................................... 13 4.2 Choose Algorithm........................................................................................................... 13 4.3 Prepare Data.................................................................................................................... 13 4.4 Train Algorithm ..............................................................................................................14 4.5 Verify Results ..................................................................................................................14 4.6 Use Application...............................................................................................................14 5 Data Mining............................................................................................................................15 6 Discussion...............................................................................................................................16 6.1 Literature.........................................................................................................................16 6.2 Future Developments .....................................................................................................16 7 Conclusion ..............................................................................................................................17 8 References...............................................................................................................................17 9 Acknowledgements ............................................................................................................... 22
  • 4. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 1 1 Introduction Computers solve problems using algorithms. These algorithms are step-by-step instructions for the computer to sequentially follow in processing a set of inputs into a set of outputs. These algorithms are typically written line-by-line by computer programmers. But what if we donโ€™t have the expertise or fundamental understanding to be able to write the algorithm for a program? For example, consider filtering spam emails from genuine emails (Alpaydin, 2010). For this problem, we know the input (an email) and the output (identifying it as spam or genuine) but we donโ€™t know what actually classifies it as a spam email. This lack of understanding often arises when there is some intellectual human involvement in the problem we are trying to solve. In this example, the human involvement is that a human wrote the original spam email. Similarly, humans are involved in handwriting recognition, natural language processing and facial recognition. It is clear that these problems are something that our subconscious is able to handle effortlessly yet we donโ€™t consciously understand the fundamentals of the process. For sequential logical tasks, like sorting a list alphabetically, we consciously understand the fundamental process and therefore can program a solution (algorithm). But this isnโ€™t possible for more complex tasks where the process is more of an unknown โ€˜black boxโ€™. Machine learning is what gives us the tools to solve these โ€˜black boxโ€™ problems. โ€œWhat we lack in knowledge, we make up for in dataโ€ (Alpaydin, 2010). Using the spam example, we can use a data set of millions of emails, some of which are spam, in order to โ€˜learnโ€™ what defines a spam email. The learning principles are derived from statistical approaches to data analysis. In this way, we do not need to understand the process but we can construct an accurate and functional model (a โ€˜black boxโ€™) to approximate the process. Whilst this doesnโ€™t explain the fundamental process, it can identify some patterns and regularities that allow us to reach solutions. Artificial intelligence was conceived in the mid-20th century but it was not until the 1980s that the more statistical branch, machine learning, began to separate off and become a field in its own right (Russell, 2010). Machine learning developed a scientific approach to solving problems of prediction and finding patterns in data. This quickly had value in industry which fuelled the academic exploration further. But entering the 21st century we have seen rapid rise in machine learning popularity. This is largely due to the emergence of large data sets and the demand for data mining processes to extract knowledge from them. Machine learning has since established itself as a leading field of computer science with applications ranging from detecting credit card fraud to medical diagnosis. Data mining is the process of mining data in order to extract knowledge (Kamber, 2000). With the rise of large data sets (โ€˜big dataโ€™), data mining has thrived. Data mining tasks can be categorised as either descriptive tasks or predictive tasks. A descriptive task involves extracting qualitative characteristics of data. For example, if you have a database of customers and want to segment the customers into groups in order to find trends within those groups. A predictive task involves using the existing data to be able to make predictions on future data inputs. For example, how can we learn from our existing customers which products might be favoured by a new customer?
  • 5. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 2 Machine learning is a vast subject with masses of literature. One of the main challenges in understanding machine learning is knowing where to start. This review will introduce the two main approaches of machine learning: supervised and unsupervised learning. We consider some of the more generalist and flexible machine learning algorithms in these categories relevant to data mining and introduce some methods of optimising them. Additionally, this review will indicate the steps to develop a machine learning application to solve a specific problem. Finally we relate this theory and practical understanding to the application of data mining. With this knowledge, the reader will have a strong machine learning foundation to enable them to approach problems and interpret relevant research themselves. 1.1 Objectives 1. Understand the background of Machine Learning. What are some of the key approaches and applications? 2. Understand some of the different mechanisms behind Machine Learning processes. 3. Explore machine learning algorithms and the decision making process of a machine learning program. 4. How do you develop a machine learning application? 5. Case/Application Focus: Investigate machine learning in relation to data mining. 6. Briefly discuss key areas for future development of this technology. 2 Supervised Machine Learning Algorithms The aim of a supervised machine learning algorithm is to learn how inputs relate to outputs in a data set and thereby produce a model able to map new inputs to inferred outputs (Ayodele, 2010). Therefore, a complete set of training data is prerequisite for any supervised learning task. A general equation for this can be defined as follows (Alpaydin, 2010): ๐‘ฆ = โ„Ž (๐‘ฅโ”‚๐œƒ) Eq. 2.1 Where the output, ๐‘ฆ, is equal to the function, โ„Ž, which is a function of the inputs, ๐‘ฅ, and the features, ๐œƒ. The role of the supervised machine learning algorithm is to optimise the parameters ( ๐œƒ) by minimising the approximation error and thereby producing the most accurate outputs. In laymanโ€™s terms, this means that existing โ€˜right answersโ€™ are used to predict new answers to the problem; it learns from examples (Russell, 2010). We are unequivocally telling the algorithm what we want to know and actively training it to be able to solve our problem. Supervised learning consists of two fundamental stages; i) training and ii) prediction. Building a bird classification system is a problem that can be solved with a supervised machine learning algorithm (Harrington, 2012). Start by taking characteristics of the object you are trying to classify, called features or attributes. For a bird classification system, these could be weight, wingspan, whether feet are webbed and the colour of its back. In reality, you can have an infinite number of features rather than just four (Ng, 2014). The features can be of different types. In this example, weight and wingspan are numeric (decimal), whether feet are webbed is simply yes or no (binary) and if you choose a selection of say 7 different colours then each โ€˜back colourโ€™ would just be an integer. According to Eq. 3.4, we want to find a function (โ„Ž) which we can use
  • 6. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 3 to determine the bird species (๐‘ฆ) given inputs of particular features (๐‘ฅ). To achieve this, we require training data (i.e. data on the weight, wingspan, etc. of a number of bird species). The training data is used (stage (i)) to determine the parameters (๐œƒ) which can be used to define a function โ„Ž. Itโ€™s unlikely this will be perfectly accurate, so we can compare the outputs from our function on a test set (where we secretly already know the true outputs) in order to measure the accuracy. Provided the function is accurate, we can use our model to predict bird species given new inputs of weight, wingspan etc., perhaps entered by users trying to identify a bird (stage (ii)). This example is extremely simplistic and leaves many questions unanswered such as how do we choose the features, how do we reach a definition for the model/function โ„Ž, how do we optimise our algorithm for maximum accuracy and how could we deal with imperfect training data (noise)? The sections which follow will seek to answer these questions. Regression and classification are both supervised learning tasks where a model is defined with a set of parameters. A regression solution is appropriate when the output is continuous, whereas a classification solution is used for discrete outputs (Ng, 2014; Harrington, 2012). 2.1 Regression In regression analysis the output is a random variable (๐‘ฆ) and the input the independent variable (๐‘ฅ). We seek to find the dependence of ๐‘ฆ on ๐‘ฅ. The mean dependence of ๐‘ฆ on ๐‘ฅ will give us the function and model (โ„Ž) that we are seeking to define (Kreyszig, 2006). The most basic form of regression using just one independent variable is called univariate linear regression. This can be used to produce a straight line function: โ„Ž(๐‘ฅ) = ๐œƒ0 + ๐œƒ1 ๐‘ฅ Eq. 2.2 By finding ๐œƒ0 and ๐œƒ1 it is therefore possible to fully define the model. In seeking to choose ๐œƒ0 and ๐œƒ1 so that โ„Ž is as close to our (๐‘ฅ,๐‘ฆ) values as possible, we must minimise the Gauss function of squared errors (Stigler, 1981; Freitas, 2013; Beyad & Maeder, 2013): ๐ฝ(๐œƒ0, ๐œƒ1) = โˆ‘(โ„Ž(๐‘ฅ๐‘–) โˆ’ ๐‘ฆ๐‘–)2 ๐‘› ๐‘–=1 Eq. 2.3 To minimise this function, we can apply the gradient descent algorithm known as the method of steepest descent (Ng, 2014; Bartholomew- Biggs, 2008; Kreyszig, 2006; Snyman, 2005; Akaike, 1974). Gradient descent is a numerical method used to minimise a multivariable function by iterating away from a point along the direction which causes the largest decrease in the function (the direction with the most negative gradient or โ€˜downwards steepnessโ€™). The equation for gradient descent is as follows: ๐œƒ๐‘— = ๐œƒ๐‘— โˆ’ ๐›ผ ๐œ• ๐œ•๐œƒ๐‘— ๐ฝ(๐œƒ0, ๐œƒ1) Eq. 2.4 Figure 2.1. Gradient descent. (Kreyszig, 2006)
  • 7. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 4 Where j = 0, 1 for this case of two unknowns. ๐›ผ is the step size taken and is known as the learning rate. The value of the learning rate determines a) whether gradient descent converges to the minimum or not and b) how quickly it converges. If the learning rate is too small, gradient descent can be slow. On the other hand, if the learning rate is too large, the steps taken may be too large resulting in overshoot and missing of the minimum. Figure 2.1 illustrates gradient descent from a starting point of ๐‘ฅ0 = ๐œƒ๐‘— 0 iterating to ๐‘ฅ1 = ๐œƒ๐‘— 1 and ๐‘ฅ2 = ๐œƒ๐‘— 2 . Eventually this will reach the minimum which lies at the centre of the innermost circle. An analogy to gradient descent is the idea of walking on the side of a hill in a valley surrounded by thick fog. The aim is to get to the bottom of the valley. Even though you cannot see where the bottom of the valley is, as long as each step you take is sloping downwards, you will certainly reach the bottom. Gradient descent is not the fastest minimisation method, however, it offers a distinct approach which is repeatedly used in many machine learning optimisation problems. Furthermore, it scales well with larger data sets (Ng, 2014) which is a significant factor in real life applications. Sub-gradient projection is a possible alternative to the descent method, however, it is typically slower than gradient descent (Kiwiel, 2001). With an appropriate learning rate, gradient descent serves as a reliable and effective tool for minimisation problems. Hence by finding values for the parameters (๐œƒ๐‘—) we are able to find an equation for the model (โ„Ž). If this model can predict values of ๐‘ฆ for novel examples, we say that it โ€˜generalisesโ€™ well (Russell, 2010). In this example, we have applied only linear regression (a 1-degree polynomial). It is possible to increase the hypothesis (โ„Ž) to a polynomial of a higher degree whereby the fit is more accurate (curved). However, as you increase the degree of the polynomial, you increase the risk of over-fitting the data; there is a balance to be reached between fitting the training data well and producing a model that generalises the data better (Sharma, Aiken & Nori, 2014). The main approach for dealing with this problem is to use the principle of Ockhamโ€™s razor: use the simplest hypothesis consistent with the data (Allaby, 2010). For example, a 1-degree polynomial is simpler than a 7-degree polynomial, so although the latter may fit training data better, the former should be preferred. It is possible to further simplify models by reducing the number of features being considered. This is achieved by discarding features which do not appear relevant (Ng, 2014; Russell, 2010). Regression is a simple yet powerful tool which can be used to teach a program to understand data inputs and accurately predict data outputs through machine learning processes.
  • 8. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 5 2.2 Classification Decision Tree Learning Decision trees are a flowchart-like method of classifying a set of data inputs. The input is a vector of features and the output is a single and unified โ€˜decisionโ€™ (Russell, 2010). This means that the output is binary; it can either be true (1) or false (0). A decision tree performs a number of tests on the data by asking questions about the input in order to filter and categorise it. This is a natural way to model how the human brain thinks through solving problems; many troubleshooting tools and โ€œHow-Toโ€ manuals are structured like decision trees. It begins at the root node, extends down branches through nodes of classification tests (decision nodes) and finally ends at a node representing a โ€˜leafโ€™ (terminal nodes) (Criminisi & Shotton, 2013). The aim is to develop a decision tree using training data which can then be used to interpret and classify novel data for which the classification is unknown. The first step in the decision tree learning process is to induce or โ€˜growโ€™ a decision tree from initial training data. We take input features/attributes and transform these into a decision tree based on provided example outputs in training data. In the example in Figure 2.2, the features are Patrons (how many people are currently sitting in the restaurant), WaitEstimate (the wait estimated by the front of house), Alternate (whether there is another restaurant option nearby), Hungry (whether customer is already hungry) and so on. The output is a decision on whether to wait for a table or not. The decision tree learning algorithm employs a โ€˜greedyโ€™ strategy of testing the most divisive attribute first (Russell, 2010). Each test divides the problem up further into sub-problems which will eventually classify the data. It is important that the training data set is as complete as possible in order to prevent decision trees being induced with mistakes. If the algorithm does not have an example for a particular scenario (e.g. WaitTime of 0-10 minutes when Patrons is full) then it could output a tree which consistently makes the wrong decision for this scenario. One of the mathematical ways in which decision tree divisions are quantifiably scored is with the measure of Information Gain (๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘›) (Myles et al., 2004; Mingers, 1989). ๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› is a mathematical tool for measuring how effectively a decision node divides the example data. This is based on the concept of information (๐ผ๐‘›๐‘“๐‘œ) defined by Eq. 2.5 (Myles et al., 2004): ๐ผ๐‘›๐‘“๐‘œ = โˆ’ โˆ‘ ( ๐‘๐‘—(๐‘ก) ๐‘(๐‘ก) ) log2 ( ๐‘๐‘—(๐‘ก) ๐‘(๐‘ก) ) ๐‘— Eq. 2.5 Where ๐‘๐‘—(๐‘ก) is number of examples in category ๐‘— at the node ๐‘ก and ๐‘(๐‘ก) is the number of examples at the node ๐‘ก. The maximum change in information by being processed by a decision node is defined by Eq. 2.6 (Myles et al., 2004): ๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› = ๐ผ๐‘›๐‘“๐‘œ(๐‘ƒ๐‘Ž๐‘Ÿ๐‘’๐‘›๐‘ก) โˆ’ โˆ‘(๐‘ ๐‘˜)๐ผ๐‘›๐‘“๐‘œ(๐ถโ„Ž๐‘–๐‘™๐‘‘ ๐‘˜) ๐‘˜ Eq. 2.6 Figure 2.2. A decision tree for deciding whether to wait for a table. (Russell, 2010)
  • 9. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 6 Where ๐‘ ๐‘˜ is the proportion of examples that are filtered into the ๐‘˜th category. The optimal decision node is therefore the node which maximises this โ€˜change in informationโ€™. Despite this quantification, there are usually several decision trees which are capable of classifying the data. To choose the optimal decision tree, inductive bias is employed (Mitchell, 1997). The inductive bias depends on the particular type of decision tree algorithm and will be explored in Section 2.2.1. Once a decision tree has been grown, the decision tree algorithm may prune the tree (Russell, 2010; Myles et al., 2004). This combats overfitting whilst dealing with noisy data by removing irrelevant decision nodes (Quinlan, 1986). The algorithm must also separately identify and remove features which do not aid the division of examples. The chi-squared significance test is the statistical method employed for this (supported by both (Quinlan, 1986) and (Russell, 2010)) known as chi-squared pruning. The data is analysed with the null hypothesis of โ€˜no underlying patternโ€™. The extent at which degree of deviation occurs in novel data compared to the training data is calculated and a cut off of say 5% significance is applied. In this way, noise in the training data is handled and the tree design is optimised. Multiple decision tree algorithms exist, exhibiting a variety of approaches. However, the most effective use of them is to combine their methodology into an ensemble algorithm in order to obtain better predictive performance than any of the individual algorithms alone. Section 2.2.1 will explore the ID3 decision tree learning algorithm which aims to induce the simplest possible tree. Sections 2.2.2 and 2.2.3 explore some ensemble methods to machine learning. 2.2.1 ID3 The majority of classification decision tree learning algorithms are variations on an original central methodology first proposed as the ID3 algorithm (Quinlan, 1986) and later refined to the C4.5 algorithm (Quinlan, 1993). The characteristics of decision tree algorithms discussed previously apply to ID3, but it has some subtleties and limitations too. One of these is that pruning does not apply to ID3 as it does not re-evaluate decision tree solutions after it has selected one. Instead, the approach taken by the ID3 algorithm is to iterate with a top-down greedy search method through all the possible decision tree outputs from the simplest possible solution gradually increasing complexity until the first valid solution. Each decision tree output is known as a hypothesis and are effectively different possible solutions to the model or function โ„Ž. This unidirectional approach works to reach a consistently satisfactory decision tree without expensive computation (Quinlan, 1986). However, it implies the algorithm never backtracks to reconsider earlier choices (Mitchell, 1997). The core decision making lies in deciding which attribute makes the optimal decision node at each point. This is solved using the statistical property ๐ผ๐‘›๐‘“๐‘œ๐บ๐‘Ž๐‘–๐‘› discussed earlier. ID3โ€™s approach is known as a hill-climbing search, starting with Figure 2.3. Searching through decision tree hypotheses from simplest to increasing complexity as directed by information gain. (Mitchell, 1997)
  • 10. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 7 empty space and building a decision tree from the top down. This approach has advantages and disadvantages (Mitchell, 1997; Quinlan, 1986). It can be considered a positive capability that ID3 in theory considers all possible decision tree permutations. Some other algorithms take a major risk of evaluating only a portion of the search space in order to leverage greater speed, but this can lead to inaccuracy. On the other hand, a problem in ID3 is the โ€˜goldfish memoryโ€™ approach of only considering the current decision tree hypothesis at any one time. This means that it does not actually calculate how many viable different decision trees there are, it simply picks the first it reaches, making pruning post- selection redundant. We consider ID3 an important algorithm to understand because it serves as a core algorithm that many extensions have developed from. It can easily be modified to utilise pruning and handle noisy data as well as optimised for less common conditions. It is important to consider why ID3โ€™s inductive bias towards simpler decision trees is optimal. Ockhamโ€™s razor approach (Allaby, 2010) advises giving preference to the simplest hypothesis that fits the data. But stating this does not make it optimal. Why is the simplest solution the best choice? It can be argued that scientists tend to follow this bias, possibly because it is less likely that a simpler solution is going to coincide with being the correct solution unless it is a perfectly accurate generalisation (what we aim to reach in machine learning) (Mitchell, 1997). Also, there is evidence that this approach will be consistently faster at reaching the solution due to only considering a portion of the data set (Quinlan, 1986). On the other hand, there are contradictions in this approach. It is entirely possible to obtain two different solutions with the exact same data by taking this approach, simply if the iterations by ID3 take two different paths. This is likely to be acceptable in most applications but may be a crucial complication for others (Mitchell, 1997). The C4.5 algorithm (Quinlan, 1993) extended the original ID3 algorithm with increased computational efficiency, ability to handle training data with missing attributes, ability to handle continuous attributes (rather than just discrete) and various other improvements. One of the most significant modifications allowed a new approach to determining the optimal decision tree solution. Choosing the first simple valid solution can be problematic if there is noise in the data. This is solved by allowing production of trees which overfit the data and then pruning them post-induction. Despite a longer sounding process, this new solution was found to be more successful in practice (Mitchell, 1997). The ID3 algorithm can be considered a basic but effective algorithm for building decision trees. With refinement to the C4.5 algorithm, it is competent at producing an adequate solution without requiring vast computing resources. For this reason, it is extremely well supported and commonly implemented across numerous programming languages. It is considered a highly credible algorithm used in engineering (Shao et al., 2001), aviation (Yan, Zhu & Qiang, 2007) and wherever automated or optimal decision making processes are required. 2.2.2 Bagging and Boosting Bagging and boosting are ensemble techniques which means they use multiple learning algorithms to improve the overall performance of the machine learning system (Banfield et al., 2007). In decision tree learning, this helps to produce the optimal decision tree (rather than just a valid one). The optimal decision tree is one that has the lowest error rate in predicting outputs ๐‘ฆ for data inputs ๐‘ฅ (Dietterich, 2000a). Bagging and boosting improve the performance by manipulating the training data before it is fed into the algorithm.
  • 11. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 8 Bagging is an abbreviation of โ€˜Bootstrap AGGregatINGโ€™ (Pino-Mejas et al., 2004) and was first developed by Leo Breiman in 1994. Bagging takes subset samples from the full training set to produce groups of training sets called โ€œbagsโ€ (Breiman, 1996). The key methodology of bagging is to take ๐‘š examples, with replacement, from the original training set. Each bag ends up containing approximately 63.2% of the original training set (Dietterich, 2000a). Boosting was first developed by Freund and Schapire in 1995 and similarly manipulates the example training data in order to improve performance of the decision tree learning algorithm (Freund & Schapire, 1996; Freund & Schapire, 1995; Freund, 1995). The key differentiator of boosting is that it assigns a weight to each example proportional to the error in prediction of considering that data (Banfield et al., 2007). Misclassified examples are given an incrementally greater weighting in each iteration of the algorithm. In subsequent iterations, the algorithm focuses on examples with a greater weighting (favouring examples which are harder to classify than those which consistently classify correctly). Breiman (1996) identified that bagging improves the performance of unstable learning algorithms but tends to reduce the performance of more stable algorithms. Decision tree learning algorithms, neural networks and rule learning algorithms are all unstable, whereas linear regression and K-nearest neighbour (Larose, 2005) algorithms are very stable. The improvements offered by bagging and boosting are therefore very relevant to decision tree learning. But why do bagging and boosting improve the performance of unstable algorithms whilst degrading stable ones? The main components of error in machine learning algorithms can be summarised as noise, bias and variance. An unstable learning algorithm is one where small changes in the training data cause significant fluctuation in the response of the algorithm (i.e. high variance) (Dietterich, 2000a). In both bagging and boosting, the training data set is perturbed to reduce the variance of the data by adding linear classifying models and hence making the algorithm more stable (Skurichina & Duin, 1998). The effect of this is shifting the focus of the algorithm to the most relevant region of the training data. On the other hand, adding linear models to an already stable model will make no difference except less examples will be considered to reach the same solution. A machine learning algorithm is considered accurate if it produces a model โ„Ž with an accuracy greater than ยฝ (i.e. the decision tree results in greater accuracy than if each decision made was a 50/50 split). Algorithms are tested to this limit by adding noise to the training data. Noisy data is training data which contains mislabelled examples. Noise is problematic for boosting and has been show to considerably reduce its classification performance (Long & Servedio, 2009; Dietterich, 2000b; Dietterich, 2000a; Freund & Schapire, 1996). This poor performance is intuitive due to the fact that the boosting method converges to harder to classify data. Mislabelled data is obviously the hardest to classify and fruitless to focus on, hence the fatal flaw of boosting. Critically, Long & Servedio (2009) showed that the most common boosting algorithms such as AdaBoost and LogitBoost reduced to accuracies of less than ยฝ for high noise data, rendering them meaningless. Conversely, when directly comparing bagging and boosting methods effectiveness, Dietterich (2000b) found that bagging was โ€œclearlyโ€ the best method. Bagging actually uses the noise to generate a more diverse collection of decision tree hypotheses and therefore introducing noise to the training data only improves accuracy. However, experimental results have shown that when there is no noise in the training data, boosting gives the best results (Banfield et al., 2007; Lemmens & Croux, 2006; Dietterich, 2000b; Freund & Schapire, 1996). Conclusively, when deciding between machine learning algorithms, an important factor to consider is confidence in the consistency of training data being provided. Boosting is ideal but bagging is more consistent.
  • 12. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 9 A possible solution to this dilemma is explored by Alfaro, Gamez & Garcia (2013) where features of both bagging and boosting are combined in the design of a new classification decision tree learning algorithm: adabag. The common goal of both bagging and boosting is to improve accuracy by modifying the training data. Based off the AdaBoost algorithm, adabag allows analysis of the error as the ensemble is grown, reducing the problem with noise. Bagging and boosting are effective techniques for improving the predictive performance of machine learning algorithms when applied to decision tree learning. By generating an ensemble of decision trees and finding the optimal hypothesis analytically, accuracy is increased. 2.2.3 Random Forests Random forests are another ensemble learning technique used to improve the performance of algorithms in decision learning. The algorithm was originally developed by Breiman (2001) to whom the term is trademarked. Random forests was an improvement on his previous technique, bagging (Breiman, 1996). Instead of choosing one optimal decision tree, random forests uses multiple and takes the mode hypothesis as the result. Although there is no single best algorithm for every situation (Wolpert & Macready, 1997), random forests has proved to be a general top performer without requirements for tuning or adjustment and notably outperforms both bagging and boosting on accuracy and speed (Banfield et al., 2007; Svetnik et al., 2003; Breiman, 2001). Breiman (2001) found that random forests favourably shares the noise-proof properties of bagging. When compared against AdaBoost, random forests showed little deterioration with 5% noise whereas AdaBoostโ€™s performance dropped markedly. This is because the Random Forest technique does not increase weights on specific subsets and so the increased noise has negligible effect, whilst AdaBoostโ€™s convergence to mislabelled examples causes its accuracy to spiral. This being said, there is always room for improvement and the Random Forest technique is by no means perfect. The mechanism of up-voting by decision trees in the random forests is one possible area for improvement (Robnik-Sikonja, 2004). Margin is a measure of how much a particular hypothesis is favoured over other hypotheses from the Random Forest decision trees. By weighting each hypothesis vote with the margin, Robnik-Sikonja (2004) found the prediction accuracy of random forests improves significantly. Decision trees are a natural choice in the development of machine learning programs. Within decision trees, there a number of different algorithms and techniques including the ones explored here plus others such as CART, CHAID and MARS. Decision trees are important because they perform with large data sets and are intuitive to use. Furthermore, techniques such as random forests can improve the robustness, accuracy and speed of the learning method. 3 Unsupervised Machine Learning Algorithms In unsupervised learning, the onus of learning is even more greatly on the computer program than the developer. Where in supervised learning you have a full set of inputs and outputs in your data, for unsupervised learning you only have inputs. The machine learning algorithm must use this input data alone to extract knowledge. In statistics, the equivalent problem is known as density estimation; the problem is finding any underlying structures to the unlabelled data (Alpaydin, 2010).
  • 13. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 10 3.1 Clustering The main unsupervised learning method is clustering: finding groups within the input data set. For example, a company may want to group their current customers in order to target groups with relevant new products and services. To do this, the company could take their database of customers and use an unsupervised clustering algorithm to divide it into customer segments. The company uses the results to have better relationships with their customers. In addition to identifying groups, the algorithm will identify outliers who sit outside of these groups. These outliers might reveal a niche that wouldnโ€™t have otherwise been noticed. There are over 100 published clustering algorithms. This review will focus on the two most used approaches to clustering: hierarchical clustering and k-means clustering. 3.1.1 Hierarchical Clustering: Agglomerative and Divisive As suggested in the name, hierarchical clustering clusters in hierarchies. Each level of clusters in the hierarchy is a combination of the clusters below it, whereby the โ€˜clustersโ€™ at the bottom of the hierarchy are singular observations and the top cluster contains the entire data set (Hastie, 2009). Hierarchical clustering is split into two sub-approaches: agglomerative (bottom-up) and divisive (top-down) as in Figure 3.1. In the agglomerative approach, clusters start out as individual data inputs and are merged into larger clusters until one cluster containing all the inputs is reached. Divisive is the reverse, starting with the cluster containing all data inputs and subdividing into smaller clusters until reaching individual inputs or a termination condition such as the distance between two of the closest clusters is above a certain amount (Kamber, 2000). The most common form of hierarchical clustering is agglomerative. Dendrograms provide a highly comprehensible way of interpreting the structure of a hierarchical clustering algorithm in a graphical format as illustrated in Figure 3.2. Agglomerative hierarchical methods are broken down into single-link methods, complete-link methods, centroid methods and more. The difference between these methods is how the distance between clusters/groups is measured. The single-link method, also known as nearest neighbour clustering (Rohlf, 1982), can be defined by the following distance ๐ท linkage function (Gan, 2007): ๐ท(๐ถ, ๐ถโ€ฒ) = min ๐‘ฅ โˆˆ ๐ถ, ๐‘ฆ โˆˆ ๐ถโ€ฒ ๐‘‘(๐‘ฅ, ๐‘ฆ) Eq. 3.1 Figure 3.2. Dendrogram from agglomerative (bottom up) clustering technique based on data on human tumors. (Hastie, 2009) Figure 3.1. Agglomerative and divisive hierarchical clustering. (Gan, 2007)
  • 14. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 11 Where ๐ถ and ๐ถโ€™ are two nonempty and non-overlapping clusters. The Euclidean distance (Gan, 2007) for ๐‘›-dimensions is: ๐‘‘(๐‘ฅ, ๐‘ฆ) = โˆš(๐‘ฅ1 โˆ’ ๐‘ฆ1)2 + (๐‘ฅ2 โˆ’ ๐‘ฆ2)2 + โ‹ฏ + (๐‘ฅ ๐‘› โˆ’ ๐‘ฆ๐‘›)2 Eq. 3.2 This is used in the agglomerative approach to find clusters/groups with the minimum Euclidean distance between them to join for the next level up in the hierarchy. This procedure repeats until all clusters are encompassed by one cluster of the entire data set. One of the main reasons hierarchical clustering is such a popular approach is the easily human- interpretable dendrogram format with which it can be represented (Hastie, 2009). Additionally, any reasonable method of measuring the distance between clusters can be used provided it can be applied to matrices. However, hierarchical clustering occasionally encounters difficulty with merge/split points (Kamber, 2000). In a hierarchical structure, this is critical as every point following a merge/split is derived from that decision. Therefore, if this decision is made poorly, the entire output will be low-quality. A number of hierarchical methods built from the fundamentals of this approach have been designed to solve the typical issues it is prone to, including BIRCH (Zhang, Ramakrishnan & Livny, 1997) and CURE (Yun-Tao Qian, Qing-Song Shi & Qi Wang, 2002). Hierarchical clustering is a simple but extremely flexible approach for applying unsupervised learning to any data set. It can be used as an assistive tool to allow specialists to make best use of their skill. For example, in medical applications such as analysis of EEG graphs, hierarchical clustering is used to identify and group sections that are alike whilst the neurologist can evaluate the medical meaning of these areas (Guess & Wilson, 2002). In this way, the work is delegated to make best use of each individual/component: the computer does the systematic analysis and the neurologist provides the medical insight. 3.1.2 K-means K-means is one of the most common approaches to clustering. First demonstrated by MacQueen (1966) it is designed for quantitative data and defines clusters by a centre point (the mean). The algorithm begins with the initialisation phase where the number of clusters/centres is fixed. Then the algorithm enters the iteration phase, iterating the positions of these centres until they reach a final central rest position (Gan, 2007). The final rest position occurs when the error function does not change significantly for further iterations. The algorithm is as follows (Hastie, 2009): 1. For a given set of ๐‘˜ clusters, C, minimise the total cluster variance of all data inputs with respect to {๐‘š1, โ€ฆ , ๐‘š ๐‘˜} yielding the means of current clusters. 2. Given the means of current clusters {๐‘š1, โ€ฆ , ๐‘š ๐‘˜}, assign each data input to the closest (current) mean for a cluster. 3. Repeat until assignments no longer change. The function being minimised is as follows (Hastie, 2009): ๐ถโˆ— = min ๐ถ, {๐‘š ๐‘˜}1 ๐พ โˆ‘ ๐‘๐‘˜ โˆ‘ ||๐‘ฅ๐‘– โˆ’ ๐‘š ๐‘˜|| 2 ๐ถ(๐‘–)=๐‘˜ ๐พ ๐‘˜=1 Eq. 3.3
  • 15. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 12 Where ๐‘ฅ represents the data inputs and ๐‘๐‘˜ = โˆ‘ ๐ผ(๐ถ(๐‘–)) = ๐‘˜)๐‘ ๐‘–=1 . Therefore ๐‘ data inputs are assigned to the ๐‘˜ clusters so that the distance between the data inputs and the cluster mean is minimised. A key advantage to using K-means is that it is effective in terms of computation even with large data sets. The computational complexity is linearly proportional to the size of the data set, rather than exponentially (Hastie, 2009). However, due to this linear approach, it can be slow on high dimensional data beyond a critical size (Harrington, 2012; Hastie, 2009). The performance of K-means is heavily dependent on the initialisation phase. Not only must the number of clusters ๐‘˜ be defined but also the initiation positions of the centres. The number of clusters ๐‘˜ depends on the goal you are trying to achieve in the analysis and is usually well defined in the problem, for example, creating ๐‘˜ customer segments, employing ๐‘˜ sales people etc. Alternatively, if this information is unavailable, a โ€œrule of thumbโ€ approach commonly taken is to set ๐‘˜ proportionally to the number of inputs in the data set (Mardia, 1979): ๐‘˜ โ‰ˆ โˆš ๐‘ 2 Eq. 3.4 For the algorithm to perform well, it is important to take a reliable approach to defining the cluster means. Fortunately this problem has popular solutions proposed as the Forgy Approach (Anderberg, 1973), Macqueen Approach (MacQueen, 1966) and Kaufman Approach (Kaufman, 1990). In comparing these, it has been found that the Kaufman approach generally produces the best clustering results (Peรฑa, Lozano & Larraรฑaga, 1999). In the Kaufman Approach, the initial cluster means are found iteratively. The starting point is the input data point closest to the centre of the data set. Following this, centres are chosen by choosing input data point positions with the highest number of other data points around them. One of the earliest applications of K-means was in signal and data processing. For example, it is used for image compression where a 24 bits image with up to 16 million colours can be compressed to an 8 bits images with only 256 (Alpaydin, 2010). The problem is finding the optimal 256 colours out of the 16 million in order to retain image quality in compression. This is a problem of vector quantisation. K-means is still used for this application today. Figure 3.3. A demonstration of iterations by the K-means clustering algorithm for simulated input data points. (Hastie, 2009)
  • 16. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 13 The standard K-means algorithm serves its purpose well, but suffers from some limitations and drawbacks. For this reason, it has been modified, extended and improved in numerous publications (Chen, Ching & Lin, 2004; Wagstaff et al., 2001). The techniques employed include a) finding better initial solutions (as discussed above), b) modifying the original algorithm and c) incorporating techniques from other algorithms into K-means. Wagstaff et al. (2001) recognised that the experimenter running the algorithm is likely to have some background knowledge on the data set being analysed. By communicating this data to the algorithm, through adding additional constraints in the clustering process, Wagstaff et al. (2001) improved the performance of K-means from 58% to 98.6%. In a separate experiment, Chen, Ching & Lin (2004) found that incorporating techniques from hierarchical methods into K-means increased clustering accuracy. This literature shows that K-means is a versatile approach to clustering which can be tailored to specific problems in order to significantly improve its accuracy. 4 Steps in developing a machine learning application So far this review has focused on the theoretical background of machine learning techniques. This section considers practically applying this theoretical knowledge to data related problems in any field of work, from collecting data through to use of the application (Harrington, 2012). 4.1 Collect Data The first step is to collect the data you wish to analyse. Sources of data may include scraping a website for data, extracting information from an RSS feed or API, existing databases, running an experiment to collect data and other sources of publicly available data. 4.2 Choose Algorithm There are a huge number of machine learning algorithms out there, so how do we choose the right one? The first decision is between supervised learning and unsupervised learning. If you are attempting to predict or forecast then you should use supervised learning. You will also need training data with a set of inputs connected to outputs. Otherwise, you should consider unsupervised learning. At the next level, choose between regression or classification (supervised learning) and clustering or density estimation (unsupervised learning). Finally at the last level, there are tens of different algorithms you could use under each of these categories. There is no single best algorithm for all problems (Harrington, 2012; Wolpert & Macready, 1997). Understanding the properties of the algorithms is helpful, but now to find the best algorithm for your problem your strategy should be to test different algorithms and choose by trial and error (Salter-Townshend et al., 2012). 4.3 Prepare Data The next step is to prepare the data in a usable format. Certain algorithms require the features/training data to be formatted in a particular way, but this is trivial. The data first needs to be cleaned and, integrated and selected (Zhang, Zhang & Yang, 2003; Kamber, 2000). Data cleaning involves filling out any missing values in features of the training data, removing noise, filtering out outliers and correcting inconsistent data. To fill out missing values, you can
  • 17. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 14 take a biased or unbiased approach. An example of biased is to use a probable value to fill in the missing value, whereas unbiased would be just removing the feature/example completely. The biased approach is popular when there are a large proportion of values missing. The random error and variance in the data is caused by noise. This is reduced by binning (Shi & Yu, 2006) or clustering the data in order to isolate and remove outliers. Data integration is simply merging data from multiple sources. Data selection is the problem of selecting the right data from the sample to use as the training data set. Generally the method of selecting data is heavily dependent on the type of data being filtered, however, Sun et al. (2013) explored an innovative generalised approach using dynamic weights for classification by putting a greater weight on data associated with the most features and eliminating redundant ones, demonstrating promising results. 4.4 Train Algorithm Now that all the data is cleaned and optimised, we can proceed to train the algorithm (for supervised learning). For unsupervised learning, this stage is just running the algorithm on the data as we donโ€™t have target values to train with. For both learning types, this is where the artificially intelligent โ€˜machine learningโ€™ occurs and where the real value of machine learning algorithms is exploited (Russell, 2010). The output of this step is raw โ€˜knowledgeโ€™. 4.5 Verify Results Before using the new found โ€˜knowledgeโ€™, it is important to verify/test it. In supervised learning, you can test the model youโ€™ve created against your existing real data set to measure the accuracy. If it is not satisfactory, you can go back to the initial data preparation stages and optimise. Verifying the accuracy of unsupervised learning algorithms is significantly more challenging and beyond the scope of this review. 4.6 Use Application Finally you can use the knowledge evaluated by your algorithm. Depending on the nature of your machine learning problem, the raw data output may be sufficient or you may choose to produce visualisations for the results (Leban, 2013). The beauty of machine learning means that we do not need to program a solution to the problem line by line, the machine learning algorithm will learn from data using statistical analysis instead. But the machine learning algorithm still needs to be developed itself. Fortunately there is no single piece of software or programming language that you must use to prepare your machine learning application. The most commonly used applications are Python, Octave, R and Matlab (Ng, 2014; Freitas, 2013; Alfaro, Gamez & Garcia, 2013; Harrington, 2012). Python is one of the most widely used because of its clear syntax, simple text manipulation and established use throughout industries and organisations (Harrington, 2012). With this information, you are now equipped with the knowledge and practical know-how to develop a machine learning application.
  • 18. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 15 5 Data Mining In the last few centuries, innovation in the human species has accelerated rapidly. With the invention of the World Wide Web and adoption of new technologies on a global scale we are using technology like never before. The by-product of the Information Age is vast amounts of data, exceeding terabytes onto petabytes and exabytes, with immense hidden value (Goodman, Kamath & Kumar, 2007). The sheer size of databases and data sets make it impossible for a human to comprehend or analyse manually. Data mining is quite literally using machine learning approaches to extract underlying information and knowledge from data (Kamber, 2000). The knowledge can contribute greatly to business strategies or scientific and medical research. The format of the knowledge extracted depends on the machine learning algorithm used. If supervised learning approaches are applied it is possible to identify patterns in data that can be used to model it (Kantardzic, 2011). Pattern recognition and learning is one of the most widely applied uses for data mining and machine learning. Unsupervised approaches are also used in data mining. Unsupervised learning makes it possible to identify natural groupings in data. The main application of this in data mining is feature learning whereby useful features are extracted from a large data set which can then be used for classification (Coates & Ng, 2012). Applications of data mining can be seen in medicine, telecommunications, finance, science, engineering and more. For example in medicine, machine learning is frequently being used to improve diagnosis of medical conditions such as cancer and schizophrenia. Data mining of clinical data such as MRI scans allows computers to learn how to recognise cancers and underlying conditions in new patients more reliably than doctors (Savage, 2012; Ryszard S Michalski, Ivan Bratko & Miroslav Kubat, 1998). In finance, data mining is now being used to assist evaluation of credit risk of individuals and companies ahead of providing financial support through loans (Correia et al., 1993). This is arguably the most important stage in the process of offering a loan but firms have previously struggled to accurately predict the risk of default. With the large data sets that have been accumulated in this domain, data mining is providing new insights and patterns to help accurately manage these risks for financial organisations. Data mining does not yet have any social stigma attached to it. However, there are ethical issues and social impacts of data mining. For example, web mining involves scraping data from the internet and mining it for knowledge (Etzioni, 1996). This data can often include personal data from web users which is used for the profit of organisations (the web miners) (Van Wel & Royakkers, 2004). Current research suggests that no harm is currently being done to web users as a result of this, but with the uprising of โ€˜big dataโ€™ there is growing demand for regulation and ensuring that the power of data mining is used for โ€˜goodโ€™ (Etlinger, 2014). As long as users remain in control and fully understand the data they offer when using the web, the threat to privacy can be neutralised. However, the risk of this line of consent and understanding becoming blurred is high. It is important for governments and organisations to acknowledge this and take a pro-active approach with regulation.
  • 19. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 16 6 Discussion 6.1 Literature In writing this review it has become clear that supervised machine learning algorithms simply apply statistical approaches to data analysis in a scalable way. In fact, one of the best technical sources of information on regression and gradient descent was a maths textbook (Kreyszig, 2006). It provided a clear explanation of the techniques despite not directly relating them to machine learning. This has demonstrated that machine learning has come a long way in its scientific and mathematical approach since originally branching out of artificial intelligence. The cause of the separation was originally due to statistical analysis no longer being supported in artificial intelligence. However, it turned out that within these statistical analysis approaches (machine learning) lied the most practical discoveries and applications of all. Unsupervised learning is perhaps more closely related to artificial intelligence. The frequently cited textbook by Russell (2010) titled โ€œArtificial Intelligenceโ€ actually served as an excellent source of insight into unsupervised machine learning algorithms, particularly hierarchical algorithms and the K-means approach. This is probably because unsupervised learning deals with the more mysterious (affiliated with artificial intelligence) type of data: unlabelled data. Additionally, it seeks to extract knowledge or โ€˜intelligenceโ€™ from this data. Unsupervised learning is particularly applicable to data mining through the application of feature learning. With feature learning, it is possible to take a huge set of data uninterpretable by humans and turn it into something that you can perform intricate data analysis on and obtain realised value. It was surprising to find that with just the elemental principals covered in this review it is possible to get started on real machine learning applications, as made apparent when discussing the review with professionals in industry. 6.2 Future Developments Machine learning is still a new scientific field with huge opportunities for growth and development. Rather than working on large static data sets, it is important to devise methods of applying machine learning to transient data and data streams (Gama, 2012). There are significant challenges to address for maintaining an accurate decision model when the data used to develop that model is continually changing. It has become clear that a bias-variance trade off exists in supervised learning problems (Sharma, Aiken & Nori, 2014). Bias and variance are both sources of error. Ideally the model should closely fit the training data but also generalise effectively for new data. In past research, there has been a focus on reducing the variance related error. However, as data sets grow larger (Cambria et al., 2013), it is important to produce models which fit closely to larger data sets. Therefore, there is a need to focus more specifically on bias related error. We now have access to more computational power than ever before. However, when comparing computing technology to the human brain, there is a clear discrepancy between the two in terms
  • 20. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 17 of how fast data is processed and how much energy is consumed to do so (Norvig, 2012). A computer can process data 100 million times faster than the brain but requires 20,000 watts of power to do so. Comparatively, the brain consumes just 20 watts of power to do the same. Yet machine learning systems are still only just managing to become as effective as the brain. We need to allocate resources to understanding the brain and using it to inspire circuit and machinery design in order to make artificial intelligence and learning processes more efficient. 7 Conclusion There are two main approaches to machine learning: supervised learning and unsupervised learning. These can be further broken down by different algorithms used to complete supervised and unsupervised learning tasks. In supervised learning, types of algorithm include regression and clustering (such as gradient descent, ID3, bagging, boosting and random forests). In unsupervised learning, types of algorithms include hierarchical and K-means clustering. Machine learning can be applied to facial recognition, medical diagnosis, search engines, shopping cart recommendation systems and much more. The common indicator of a good application is that a large source of data exists related to the problem. Machine learning algorithms can then use their tailored decision making to translate that data into usable knowledge, producing value. The process of developing a machine learning algorithm is summarised as follows: start by collecting data, choose an appropriate algorithm, prepare the data, train the algorithm with sample data, verify the results and finally apply the knowledge produced by the algorithm. Data mining is a growing application of machine learning as the World Wide Web and Information Age have introduced data sets on a scale like never before. Going forward, it is important to only use data mining ethically and not to the detriment of web users. As most of the development in machine learning has happened in the past 30 years, there is still much to be done. We should continue to use the human brain as a North Star in guiding further research. The goal is to realise true artificial intelligence through improving machine learning algorithms which may one day compete with the performance of our own brains. 8 References Akaike, H. (1974) NEW LOOK AT THE STATISTICAL MODEL IDENTIFICATION. IEEE Transactions on Automatic Control. AC-19 (6), 716-723. Alfaro, E., Gamez, M. & Garcia, N. (2013) adabag: An R Package for Classification with Boosting and Bagging. Journal of Statistical Software; J.Stat.Softw. 54 (2), 1-35. Allaby, M. (2010) Ockham's razor, A Dictionary of Ecology. Oxford University Press.
  • 21. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 18 Alpaydin, E. (2010) Introduction to machine learning. 2nd edition. Cambridge, Mass. ; London, MIT Press. Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London, Academic Press. Ayodele, T. O. (2010) Types of Machine Learning Algorithms, New Advances in Machine Learning, Yagang Zhang (Ed.), ISBN: 978-953-307-034-6, InTech. Banfield, R. E., Hall, L. O., Bowyer, K. W. & Kegelmeyer, K. W. (2007) A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (1), 173-180. Bartholomew-Biggs, M. (2008) Nonlinear Optimization with Engineering Applications. Dordrecht, Springer. Beyad, Y. & Maeder, M. (2013) Multivariate linear regression with missing values. Analytica Chimica Acta. 796 (0), 38-41. Breiman, L. (1996) Bagging predictors. Machine Learning. 24 (2), 123-140. Breiman, L. (2001) Random Forests. Machine Learning. 45 (1), 5-32. Cambria, E., Huang, G., Zhou, H., Vong, C., Lin, J., Yin, J., Cai, Z., Liu, Q., Li, K., Feng, L., Ong, Y., Lim, M., Akusok, A., Lendasse, A., Corona, F., Nian, R., Miche, Y., Gastaldo, P., Zunino, R., Decherchi, S., Yang, X., Mao, K., Oh, B., Jeon, J., Toh, K., Kim, J., Yu, H., Chen, Y. & Liu, J. (2013) Extreme Learning Machines. IEEE Intelligent Systems. 28 (6), 30-59. Chen, J., Ching, R. K. H. & Lin, Y. (2004) An extended study of the K- means algorithm for data clustering and its applications. Journal of the Operational Research Society. 55 (9), 976-987. Coates, A. & Ng, A. Y. (2012) Learning feature representations with K- means. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 7700, 561-580. Correia, J., Costa, E., Ferreira, J. & Jamet, T. (1993) An Application of Machine Learning in the Domain of Loan Analysis. Lecture Notes in Computer Science. 667, 414-419. Criminisi, A. & Shotton, J. (2013) Decision Forests for Computer Vision and Medical Image Analysis. 2013th edition. Dietterich, T. (2000a) Ensemble methods in machine learning. Multiple Classifier Systems. 1857, 1-15. Dietterich, T. (2000b) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning; Mach.Learn. 40 (2), 139-157. Etlinger, S. (2014) What do we do with all this big data? TED.com, https://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data.
  • 22. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 19 Etzioni, O. (1996) The World- Wide Web: Quagmire or Gold Mine? Communications of the ACM. 39 (11), 65-68. Freitas, N. d. (2013) Machine Learning Lecture Course. de Freitas, Nando, University of British Columbia, Oxford University. Freund, Y. & Schapire, R. E. (1996) Experiments with a new boosting algorithm. ICML. pp.148- 156. Freund, Y. (1995) BOOSTING A WEAK LEARNING ALGORITHM BY MAJORITY. Information and Computation; Inf.Comput. 121 (2), 256-285. Freund, Y. & Schapire, R. E. (1995) A decision- theoretic generalization of on-line learning and an application to boosting. Lecture Notes in Computer Science. 904, 23-37. Gama, J. (2012) A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence. 1 (1), 45-55. Gan, G. (2007) Data clustering : theory, algorithms, and applications. Philadelphia, PA, Society for Industrial and Applied Mathematics. Goodman, A., Kamath, C. & Kumar, V. (2007) Statistical analysis and data mining: Data Analysis in the 21st Century. Statistical Analysis and Data Mining. , . Guess, M. J. & Wilson, S. B. (2002) Introduction to hierarchical clustering. Journal of Clinical Neurophysiology. 19 (2), 144-151. Harrington, P., 1977-. (2012) Machine learning in action. Shelter Island, N.Y., Manning Publications. Hastie, T. (2009) The elements of statistical learning : data mining, inference, and prediction. 2nd edition. New York, Springer. Kamber, M. (2000) Data mining: concepts and techniques. San Francisco ; London, San Francisco ; London Morgan Kaufmann. Kantardzic, M. (2011) Data Mining Concepts, Models, Methods, and Algorithms. 2nd edition. Hoboken, Wiley. Kaufman, L. (1990) Finding groups in data an introduction to cluster analysis. S.l.}, Wiley. Kiwiel, K. C. (2001) Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical Programming, Series B. 90 (1), 1-25. Kreyszig, E. (2006) Advanced engineering mathematics. 9th, International edition. Hoboken, N.J., Wiley. Larose, D. T. (2005) k โ€ Nearest Neighbor Algorithm. Hoboken, NJ, USA. Leban, G. (2013) Information visualization using machine learning. Informatica (Slovenia). 37 (1), 109-110.
  • 23. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 20 Lemmens, A. & Croux, C. (2006) Bagging and boosting classification trees to predict churn. Journal of Marketing Research. , . Long, P. M. & Servedio, R. A. (2009) Random classification noise defeats all convex potential boosters. Machine Learning. , 1-18. MacQueen, J. B. (1966) SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE OBSERVATIONS. Mardia, K. V. (1979) Multivariate analysis. London, Academic Press. Mingers, J. (1989) An empirical comparison of selection measures for decision-tree induction. Machine Learning. 3 (4), 319-342. Mitchell, T. M. (. M., 1951-. (1997) Machine learning. Boston, Mass., WCB/McGraw-Hill. Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. (2004) An introduction to decision tree modeling. Journal of Chemometrics. 18 (6), 275-285. Ng, A. (2014) Machine Learning (Coursera) - Stanford by Andrew Ng. , coursera.org. Norvig, P. (2012) Artificial intelligence: A new future. New Scientist. 216 (2889), vi-vii. Peรฑa, J. M., Lozano, J. A. & Larraรฑaga, P. (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters. 20 (10), 1027-1040. Pino-Mejas, R., Cubiles-de-la-Vega, M., Lapez-Coello, M., Silva-Ramarez, E. & Jimanez- Gamero, M. (2004) Bagging Classification Models with Reduced Bootstrap. In: Fred, A., Caelli, T., Duin, R. W., Campilho, A. & de Ridder, D. (eds.). , Springer Berlin Heidelberg. pp. 966-973. Quinlan, J. R. (1993) C4.5 : programs for machine learning. Amsterdam, Morgan Kaufmann. Quinlan, J. R. (1986) Induction of decision trees. Machine Learning. 1 (1), 81-106. Robnik-Sikonja, M. (2004) Improving random forests. Machine Learning: Ecml 2004, Proceedings. 3201, 359-370. Rohlf, F. J. (1982) 12 Single- link clustering algorithms. Handbook of Statistics. 2, 267-284. Russell, S. J. (. J. (2010) Artificial intelligence : a modern approach. 3rd, International edition. Boston, Mass.] ; London, Pearson. Ryszard S Michalski, Ivan Bratko & Miroslav Kubat. (1998) Machine learning and data mining : methods and applications. Chichester, Chichester : Wiley. Salter-Townshend, M., White, A., Gollini, I. & Murphy, T. B. (2012) Review of statistical network analysis: models, algorithms, and software. Statistical Analysis and Data Mining. 5 (4), 243-264. Savage, N. (2012) Better Medicine Through Machine Learning. Communications of the ACM. 55 (1), 17-19.
  • 24. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 21 Shao, X., Zhang, G., Li, P. & Chen, Y. (2001) Application of ID3 algorithm in knowledge acquisition for tolerance design. Journal of Materials Processing Tech. 117 (1), 66-74. Sharma, R., Aiken, A. & Nori, A. V. (2014) Bias- variance tradeoffs in program analysis. Shi, T. & Yu, B. (2006) Machine Learning and Data Mining - Binning in Gaussian kernel regularization. Statistica Sinica. 16 (2), 541-568. Skurichina, M. & Duin, R. P. W. (1998) Bagging for linear classifiers. Pattern Recognition. 31 (7), 909-930. Snyman, J. A. (2005) Practical Mathematical Optimization An Introduction to Basic Optimization Theory and Classical and New Gradient-based Algorithms. Dordrecht, Springer- Verlag New York Inc. Stigler, S. M. (1981) Gauss and the Invention of Least Squares. The Annals of Statistics. 9 (3), 465-474. Sun, X., Liu, Y., Chen, H., Han, J., Wang, K. & Xu, M. (2013) Feature selection using dynamic weights for classification. Knowledge-Based Systems. 37, 541-549. Svetnik, V., Liaw, A., Tong, C., Culberson, J., Sheridan, R. & Feuston, B. (2003) Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences; J.Chem.Inf.Comput.Sci. 43 (6), 1947-1958. Van Wel, L. & Royakkers, L. (2004) Ethical issues in web data mining. Ethics and Information Technology. 6 (2), 129-140. Wagstaff, K., Cardie, C., Rogers, S. & Schrรถdl, S. (2001) Constrained k-means clustering with background knowledge. ICML. pp.577-584. Wolpert, D. H. & Macready, W. G. (1997) No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation. 1 (1), 67-82. Yan, K., Zhu, J. & Qiang, S. (2007) The application of ID3 algorithm in aviation marketing. Yun-Tao Qian, Y. Q., Qing-Song Shi, Q. S. & Qi Wang, Q. W. (2002) CURE-NS: a hierarchical clustering algorithm with new shrinking scheme. Zhang, S. C., Zhang, C. Q. & Yang, Q. (2003) Data preparation for data mining. Applied Artificial Intelligence. 17 (5-6), 375-381. Zhang, T., Ramakrishnan, R. & Livny, M. (1997) BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery. 1 (2), 141-182.
  • 25. AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING 22 9 Acknowledgements The author would like to acknowledge and thank Dr Frederic Cegla (Senior Lecturer at Imperial College London) for his supervision of this literature review project. Additionally, Shaun Dowling (Co-founder at Interpretive.io), Barney Hussey-Yeo (Data Scientist at Wonga), Ferenc Huszar (Data Scientist at Balderton Capital) and Joseph Root (Co-founder at Permutive.com) for sharing their insights on machine learning.