Guy Riese Literature Review

AN INTRODUCTORY REVIEW OF MACHINE LEARNING ALGORITHMS AND THEIR APPLICATION TO DATA MINING
IMPERIAL COLLEGE LONDON
AN INTRODUCTORY REVIEW OF
MACHINE LEARNING ALGORITHMS AND
THEIR APPLICATION TO DATA MINING
DEPARTMENT OF MECHANICAL ENGINEERING
GUY RIESE
19/12/2014

i
Abstract
This review aims to provide an introduction to machine learning by reviewing literature on the
subject of supervised and unsupervised machine learning algorithms, development of
applications and data mining. In supervised learning, the focus is on regression and
classification approaches. ID3, bagging, boosting and random forests are explored in detail. In
unsupervised learning, hierarchical and K-means clustering are studied. The development of a
machine learning application starts by collecting and preparing data, then choosing and
training an algorithm and finally using your application. Large data sets are on the rise with
growing use of the World Wide Web, opening up opportunities in data mining where it is
possible to extract knowledge from raw data. It is found that machine learning has a vast range
of applications in everyday life and industry. The elementary introduction provided by this
review offers the reader a sound foundational basis with which to begin experimentation and
exploration of machine learning applications in more depth.

ii
Contents
1 Introduction..............................................................................................................................1
1.1 Objectives......................................................................................................................... 2
2 Supervised Machine Learning Algorithms ............................................................................. 2
2.1 Regression ........................................................................................................................ 3
2.2 Classification Decision Tree Learning............................................................................. 5
2.2.1 ID3.............................................................................................................................6
2.2.2 Bagging and Boosting............................................................................................... 7
2.2.3 Random Forests........................................................................................................9
3 Unsupervised Machine Learning Algorithms.........................................................................9
3.1 Clustering........................................................................................................................10
3.1.1 Hierarchical Clustering: Agglomerative and Divisive............................................10
3.1.2 K-means....................................................................................................................11
4 Steps in developing a machine learning application............................................................. 13
4.1 Collect Data..................................................................................................................... 13
4.2 Choose Algorithm........................................................................................................... 13
4.3 Prepare Data.................................................................................................................... 13
4.4 Train Algorithm ..............................................................................................................14
4.5 Verify Results ..................................................................................................................14
4.6 Use Application...............................................................................................................14
5 Data Mining............................................................................................................................15
6 Discussion...............................................................................................................................16
6.1 Literature.........................................................................................................................16
6.2 Future Developments .....................................................................................................16
7 Conclusion ..............................................................................................................................17
8 References...............................................................................................................................17
9 Acknowledgements ............................................................................................................... 22

1
1 Introduction
Computers solve problems using algorithms. These algorithms are step-by-step instructions for
the computer to sequentially follow in processing a set of inputs into a set of outputs. These
algorithms are typically written line-by-line by computer programmers. But what if we don’t
have the expertise or fundamental understanding to be able to write the algorithm for a
program?
For example, consider filtering spam emails from genuine emails (Alpaydin, 2010). For this
problem, we know the input (an email) and the output (identifying it as spam or genuine) but
we don’t know what actually classifies it as a spam email. This lack of understanding often arises
when there is some intellectual human involvement in the problem we are trying to solve. In
this example, the human involvement is that a human wrote the original spam email.
Similarly, humans are involved in handwriting recognition, natural language processing and
facial recognition. It is clear that these problems are something that our subconscious is able to
handle effortlessly yet we don’t consciously understand the fundamentals of the process. For
sequential logical tasks, like sorting a list alphabetically, we consciously understand the
fundamental process and therefore can program a solution (algorithm). But this isn’t possible
for more complex tasks where the process is more of an unknown ‘black box’.
Machine learning is what gives us the tools to solve these ‘black box’ problems. “What we lack
in knowledge, we make up for in data” (Alpaydin, 2010). Using the spam example, we can use a
data set of millions of emails, some of which are spam, in order to ‘learn’ what defines a spam
email. The learning principles are derived from statistical approaches to data analysis. In this
way, we do not need to understand the process but we can construct an accurate and functional
model (a ‘black box’) to approximate the process. Whilst this doesn’t explain the fundamental
process, it can identify some patterns and regularities that allow us to reach solutions.
Artificial intelligence was conceived in the mid-20th
century but it was not until the 1980s that
the more statistical branch, machine learning, began to separate off and become a field in its
own right (Russell, 2010). Machine learning developed a scientific approach to solving problems
of prediction and finding patterns in data. This quickly had value in industry which fuelled the
academic exploration further. But entering the 21st
century we have seen rapid rise in machine
learning popularity. This is largely due to the emergence of large data sets and the demand for
data mining processes to extract knowledge from them. Machine learning has since established
itself as a leading field of computer science with applications ranging from detecting credit card
fraud to medical diagnosis.
Data mining is the process of mining data in order to extract knowledge (Kamber, 2000). With
the rise of large data sets (‘big data’), data mining has thrived. Data mining tasks can be
categorised as either descriptive tasks or predictive tasks. A descriptive task involves extracting
qualitative characteristics of data. For example, if you have a database of customers and want to
segment the customers into groups in order to find trends within those groups. A predictive
task involves using the existing data to be able to make predictions on future data inputs. For
example, how can we learn from our existing customers which products might be favoured by
a new customer?

2
Machine learning is a vast subject with masses of literature. One of the main challenges in
understanding machine learning is knowing where to start. This review will introduce the two
main approaches of machine learning: supervised and unsupervised learning. We consider some
of the more generalist and flexible machine learning algorithms in these categories relevant to
data mining and introduce some methods of optimising them. Additionally, this review will
indicate the steps to develop a machine learning application to solve a specific problem. Finally
we relate this theory and practical understanding to the application of data mining. With this
knowledge, the reader will have a strong machine learning foundation to enable them to
approach problems and interpret relevant research themselves.
1.1 Objectives
1. Understand the background of Machine Learning. What are some of the
key approaches and applications?
2. Understand some of the different mechanisms behind Machine Learning processes.
3. Explore machine learning algorithms and the decision making process of a machine
learning program.
4. How do you develop a machine learning application?
5. Case/Application Focus: Investigate machine learning in relation to data mining.
6. Briefly discuss key areas for future development of this technology.
2 Supervised Machine Learning Algorithms
The aim of a supervised machine learning algorithm is to learn how inputs relate to outputs in
a data set and thereby produce a model able to map new inputs to inferred outputs (Ayodele,
2010). Therefore, a complete set of training data is prerequisite for any supervised learning task.
A general equation for this can be defined as follows (Alpaydin, 2010):
𝑦 = ℎ (𝑥│𝜃) Eq. 2.1
Where the output, 𝑦, is equal to the function, ℎ, which is a function of the inputs, 𝑥, and the
features, 𝜃. The role of the supervised machine learning algorithm is to optimise the parameters
( 𝜃) by minimising the approximation error and thereby producing the most accurate outputs.
In layman’s terms, this means that existing ‘right answers’ are used to predict new answers to
the problem; it learns from examples (Russell, 2010). We are unequivocally telling the algorithm
what we want to know and actively training it to be able to solve our problem. Supervised
learning consists of two fundamental stages; i) training and ii) prediction.
Building a bird classification system is a problem that can be solved with a supervised machine
learning algorithm (Harrington, 2012). Start by taking characteristics of the object you are trying
to classify, called features or attributes. For a bird classification system, these could be weight,
wingspan, whether feet are webbed and the colour of its back. In reality, you can have an infinite
number of features rather than just four (Ng, 2014). The features can be of different types. In
this example, weight and wingspan are numeric (decimal), whether feet are webbed is simply
yes or no (binary) and if you choose a selection of say 7 different colours then each ‘back colour’
would just be an integer. According to Eq. 3.4, we want to find a function (ℎ) which we can use

3
to determine the bird species (𝑦) given inputs of particular features (𝑥). To achieve this, we
require training data (i.e. data on the weight, wingspan, etc. of a number of bird species). The
training data is used (stage (i)) to determine the parameters (𝜃) which can be used to define a
function ℎ. It’s unlikely this will be perfectly accurate, so we can compare the outputs from our
function on a test set (where we secretly already know the true outputs) in order to measure the
accuracy. Provided the function is accurate, we can use our model to predict bird species given
new inputs of weight, wingspan etc., perhaps entered by users trying to identify a bird (stage
(ii)).
This example is extremely simplistic and leaves many questions unanswered such as how do we
choose the features, how do we reach a definition for the model/function ℎ, how do we optimise
our algorithm for maximum accuracy and how could we deal with imperfect training data
(noise)? The sections which follow will seek to answer these questions. Regression and
classification are both supervised learning tasks where a model is defined with a set of
parameters. A regression solution is appropriate when the output is continuous, whereas a
classification solution is used for discrete outputs (Ng, 2014; Harrington, 2012).
2.1 Regression
In regression analysis the output is a random variable (𝑦) and the input the independent variable
(𝑥). We seek to find the dependence of 𝑦 on 𝑥. The mean dependence of 𝑦 on 𝑥 will give us the
function and model (ℎ) that we are seeking to define (Kreyszig, 2006). The most basic form of
regression using just one independent variable is called univariate linear regression. This can be
used to produce a straight line function:
ℎ(𝑥) = 𝜃0 + 𝜃1 𝑥 Eq. 2.2
By finding 𝜃0 and 𝜃1 it is therefore possible to fully define the model. In seeking to choose 𝜃0
and 𝜃1 so that ℎ is as close to our (𝑥,𝑦) values as possible, we must minimise the Gauss function
of squared errors (Stigler, 1981; Freitas, 2013; Beyad & Maeder, 2013):
𝐽(𝜃0, 𝜃1) = ∑(ℎ(𝑥𝑖) − 𝑦𝑖)2
𝑛
𝑖=1
Eq. 2.3
To minimise this function, we can apply the
gradient descent algorithm known as the method
of steepest descent (Ng, 2014; Bartholomew-
Biggs, 2008; Kreyszig, 2006; Snyman, 2005;
Akaike, 1974). Gradient descent is a numerical
method used to minimise a multivariable
function by iterating away from a point along the
direction which causes the largest decrease in the
function (the direction with the most negative
gradient or ‘downwards steepness’).
The equation for gradient descent is as follows:
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃𝑗
𝐽(𝜃0, 𝜃1) Eq. 2.4
Figure 2.1. Gradient descent. (Kreyszig, 2006)

4
Where j = 0, 1 for this case of two unknowns. 𝛼 is the step size taken and is known as the learning
rate. The value of the learning rate determines a) whether gradient descent converges to the
minimum or not and b) how quickly it converges. If the learning rate is too small, gradient
descent can be slow. On the other hand, if the learning rate is too large, the steps taken may be
too large resulting in overshoot and missing of the minimum. Figure 2.1 illustrates gradient
descent from a starting point of 𝑥0 = 𝜃𝑗
0
iterating to 𝑥1 = 𝜃𝑗
1
and 𝑥2 = 𝜃𝑗
2
. Eventually this will
reach the minimum which lies at the centre of the innermost circle.
An analogy to gradient descent is the idea of walking on the side of a hill in a valley surrounded
by thick fog. The aim is to get to the bottom of the valley. Even though you cannot see where
the bottom of the valley is, as long as each step you take is sloping downwards, you will certainly
reach the bottom.
Gradient descent is not the fastest minimisation method, however, it offers a distinct approach
which is repeatedly used in many machine learning optimisation problems. Furthermore, it
scales well with larger data sets (Ng, 2014) which is a significant factor in real life applications.
Sub-gradient projection is a possible alternative to the descent method, however, it is typically
slower than gradient descent (Kiwiel, 2001). With an appropriate learning rate, gradient descent
serves as a reliable and effective tool for minimisation problems.
Hence by finding values for the parameters (𝜃𝑗) we are able to find an equation for the model
(ℎ). If this model can predict values of 𝑦 for novel examples, we say that it ‘generalises’ well
(Russell, 2010). In this example, we have applied only linear regression (a 1-degree polynomial).
It is possible to increase the hypothesis (ℎ) to a polynomial of a higher degree whereby the fit is
more accurate (curved). However, as you increase the degree of the polynomial, you increase
the risk of over-fitting the data; there is a balance to be reached between fitting the training
data well and producing a model that generalises the data better (Sharma, Aiken & Nori, 2014).
The main approach for dealing with this problem is to use the principle of Ockham’s razor: use
the simplest hypothesis consistent with the data (Allaby, 2010). For example, a 1-degree
polynomial is simpler than a 7-degree polynomial, so although the latter may fit training data
better, the former should be preferred. It is possible to further simplify models by reducing the
number of features being considered. This is achieved by discarding features which do not
appear relevant (Ng, 2014; Russell, 2010).
Regression is a simple yet powerful tool which can be used to teach a program to understand
data inputs and accurately predict data outputs through machine learning processes.

5
2.2 Classification Decision Tree Learning
Decision trees are a flowchart-like
method of classifying a set of data
inputs. The input is a vector of features
and the output is a single and unified
‘decision’ (Russell, 2010). This means
that the output is binary; it can either
be true (1) or false (0). A decision tree
performs a number of tests on the data
by asking questions about the input in
order to filter and categorise it. This is
a natural way to model how the human
brain thinks through solving
problems; many troubleshooting tools
and “How-To” manuals are structured
like decision trees. It begins at the root
node, extends down branches through nodes of classification tests (decision nodes) and finally
ends at a node representing a ‘leaf’ (terminal nodes) (Criminisi & Shotton, 2013). The aim is to
develop a decision tree using training data which can then be used to interpret and classify novel
data for which the classification is unknown.
The first step in the decision tree learning process is to induce or ‘grow’ a decision tree from
initial training data. We take input features/attributes and transform these into a decision tree
based on provided example outputs in training data. In the example in Figure 2.2, the features are
Patrons (how many people are currently sitting in the restaurant), WaitEstimate (the wait
estimated by the front of house), Alternate (whether there is another restaurant option nearby),
Hungry (whether customer is already hungry) and so on. The output is a decision on whether
to wait for a table or not. The decision tree learning algorithm employs a ‘greedy’ strategy of
testing the most divisive attribute first (Russell, 2010). Each test divides the problem up further
into sub-problems which will eventually classify the data. It is important that the training data
set is as complete as possible in order to prevent decision trees being induced with mistakes. If
the algorithm does not have an example for a particular scenario (e.g. WaitTime of 0-10 minutes
when Patrons is full) then it could output a tree which consistently makes the wrong decision
for this scenario.
One of the mathematical ways in which decision tree divisions are quantifiably scored is with
the measure of Information Gain (𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛) (Myles et al., 2004; Mingers, 1989). 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 is a
mathematical tool for measuring how effectively a decision node divides the example data. This
is based on the concept of information (𝐼𝑛𝑓𝑜) defined by Eq. 2.5 (Myles et al., 2004):
𝐼𝑛𝑓𝑜 = − ∑ (
𝑁𝑗(𝑡)
𝑁(𝑡)
) log2 (
𝑁𝑗(𝑡)
𝑁(𝑡)
)
𝑗
Eq. 2.5
Where 𝑁𝑗(𝑡) is number of examples in category 𝑗 at the node 𝑡 and 𝑁(𝑡) is the number of
examples at the node 𝑡. The maximum change in information by being processed by a decision
node is defined by Eq. 2.6 (Myles et al., 2004):
𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 = 𝐼𝑛𝑓𝑜(𝑃𝑎𝑟𝑒𝑛𝑡) − ∑(𝑝 𝑘)𝐼𝑛𝑓𝑜(𝐶ℎ𝑖𝑙𝑑 𝑘)
𝑘
Eq. 2.6
Figure 2.2. A decision tree for deciding whether to wait for a table.
(Russell, 2010)

6
Where 𝑝 𝑘 is the proportion of examples that are filtered into the 𝑘th category. The optimal
decision node is therefore the node which maximises this ‘change in information’.
Despite this quantification, there are usually several decision trees which are capable of
classifying the data. To choose the optimal decision tree, inductive bias is employed (Mitchell,
1997). The inductive bias depends on the particular type of decision tree algorithm and will be
explored in Section 2.2.1.
Once a decision tree has been grown, the decision tree algorithm may prune the tree (Russell,
2010; Myles et al., 2004). This combats overfitting whilst dealing with noisy data by removing
irrelevant decision nodes (Quinlan, 1986). The algorithm must also separately identify and
remove features which do not aid the division of examples. The chi-squared significance test is
the statistical method employed for this (supported by both (Quinlan, 1986) and (Russell, 2010))
known as chi-squared pruning. The data is analysed with the null hypothesis of ‘no underlying
pattern’. The extent at which degree of deviation occurs in novel data compared to the training
data is calculated and a cut off of say 5% significance is applied. In this way, noise in the training
data is handled and the tree design is optimised.
Multiple decision tree algorithms exist, exhibiting a variety of approaches. However, the most
effective use of them is to combine their methodology into an ensemble algorithm in order to
obtain better predictive performance than any of the individual algorithms alone. Section 2.2.1
will explore the ID3 decision tree learning algorithm which aims to induce the simplest possible
tree. Sections 2.2.2 and 2.2.3 explore some ensemble methods to machine learning.
2.2.1 ID3
The majority of classification decision tree learning algorithms are variations on an original
central methodology first proposed as the ID3 algorithm (Quinlan, 1986) and later refined to
the C4.5 algorithm (Quinlan, 1993). The characteristics of decision tree algorithms discussed
previously apply to ID3, but it has some subtleties and limitations too. One of these is that
pruning does not apply to ID3 as it does not re-evaluate decision tree solutions after it has
selected one.
Instead, the approach taken by the ID3 algorithm is
to iterate with a top-down greedy search method
through all the possible decision tree outputs from
the simplest possible solution gradually increasing
complexity until the first valid solution. Each
decision tree output is known as a hypothesis and
are effectively different possible solutions to the
model or function ℎ. This unidirectional approach
works to reach a consistently satisfactory decision
tree without expensive computation (Quinlan,
1986). However, it implies the algorithm never
backtracks to reconsider earlier choices (Mitchell,
1997). The core decision making lies in deciding
which attribute makes the optimal decision node at
each point. This is solved using the statistical
property 𝐼𝑛𝑓𝑜𝐺𝑎𝑖𝑛 discussed earlier. ID3’s approach
is known as a hill-climbing search, starting with
Figure 2.3. Searching through decision tree
hypotheses from simplest to increasing complexity
as directed by information gain. (Mitchell, 1997)

7
empty space and building a decision tree from the top down.
This approach has advantages and disadvantages (Mitchell, 1997; Quinlan, 1986). It can be
considered a positive capability that ID3 in theory considers all possible decision tree
permutations. Some other algorithms take a major risk of evaluating only a portion of the search
space in order to leverage greater speed, but this can lead to inaccuracy. On the other hand, a
problem in ID3 is the ‘goldfish memory’ approach of only considering the current decision tree
hypothesis at any one time. This means that it does not actually calculate how many viable
different decision trees there are, it simply picks the first it reaches, making pruning post-
selection redundant. We consider ID3 an important algorithm to understand because it serves
as a core algorithm that many extensions have developed from. It can easily be modified to
utilise pruning and handle noisy data as well as optimised for less common conditions.
It is important to consider why ID3’s inductive bias towards simpler decision trees is optimal.
Ockham’s razor approach (Allaby, 2010) advises giving preference to the simplest hypothesis
that fits the data. But stating this does not make it optimal. Why is the simplest solution the
best choice? It can be argued that scientists tend to follow this bias, possibly because it is less
likely that a simpler solution is going to coincide with being the correct solution unless it is a
perfectly accurate generalisation (what we aim to reach in machine learning) (Mitchell, 1997).
Also, there is evidence that this approach will be consistently faster at reaching the solution due
to only considering a portion of the data set (Quinlan, 1986). On the other hand, there are
contradictions in this approach. It is entirely possible to obtain two different solutions with the
exact same data by taking this approach, simply if the iterations by ID3 take two different paths.
This is likely to be acceptable in most applications but may be a crucial complication for others
(Mitchell, 1997).
The C4.5 algorithm (Quinlan, 1993) extended the original ID3 algorithm with increased
computational efficiency, ability to handle training data with missing attributes, ability to
handle continuous attributes (rather than just discrete) and various other improvements. One
of the most significant modifications allowed a new approach to determining the optimal
decision tree solution. Choosing the first simple valid solution can be problematic if there is
noise in the data. This is solved by allowing production of trees which overfit the data and then
pruning them post-induction. Despite a longer sounding process, this new solution was found
to be more successful in practice (Mitchell, 1997).
The ID3 algorithm can be considered a basic but effective algorithm for building decision trees.
With refinement to the C4.5 algorithm, it is competent at producing an adequate solution
without requiring vast computing resources. For this reason, it is extremely well supported and
commonly implemented across numerous programming languages. It is considered a highly
credible algorithm used in engineering (Shao et al., 2001), aviation (Yan, Zhu & Qiang, 2007)
and wherever automated or optimal decision making processes are required.
2.2.2 Bagging and Boosting
Bagging and boosting are ensemble techniques which means they use multiple learning
algorithms to improve the overall performance of the machine learning system (Banfield et al.,
2007). In decision tree learning, this helps to produce the optimal decision tree (rather than just
a valid one). The optimal decision tree is one that has the lowest error rate in predicting outputs
𝑦 for data inputs 𝑥 (Dietterich, 2000a). Bagging and boosting improve the performance by
manipulating the training data before it is fed into the algorithm.

8
Bagging is an abbreviation of ‘Bootstrap AGGregatING’ (Pino-Mejas et al., 2004) and was first
developed by Leo Breiman in 1994. Bagging takes subset samples from the full training set to
produce groups of training sets called “bags” (Breiman, 1996). The key methodology of bagging
is to take 𝑚 examples, with replacement, from the original training set. Each bag ends up
containing approximately 63.2% of the original training set (Dietterich, 2000a).
Boosting was first developed by Freund and Schapire in 1995 and similarly manipulates the
example training data in order to improve performance of the decision tree learning algorithm
(Freund & Schapire, 1996; Freund & Schapire, 1995; Freund, 1995). The key differentiator of
boosting is that it assigns a weight to each example proportional to the error in prediction of
considering that data (Banfield et al., 2007). Misclassified examples are given an incrementally
greater weighting in each iteration of the algorithm. In subsequent iterations, the algorithm
focuses on examples with a greater weighting (favouring examples which are harder to classify
than those which consistently classify correctly).
Breiman (1996) identified that bagging improves the performance of unstable learning
algorithms but tends to reduce the performance of more stable algorithms. Decision tree
learning algorithms, neural networks and rule learning algorithms are all unstable, whereas
linear regression and K-nearest neighbour (Larose, 2005) algorithms are very stable. The
improvements offered by bagging and boosting are therefore very relevant to decision tree
learning. But why do bagging and boosting improve the performance of unstable algorithms
whilst degrading stable ones? The main components of error in machine learning algorithms
can be summarised as noise, bias and variance. An unstable learning algorithm is one where
small changes in the training data cause significant fluctuation in the response of the algorithm
(i.e. high variance) (Dietterich, 2000a). In both bagging and boosting, the training data set is
perturbed to reduce the variance of the data by adding linear classifying models and hence
making the algorithm more stable (Skurichina & Duin, 1998). The effect of this is shifting the
focus of the algorithm to the most relevant region of the training data. On the other hand,
adding linear models to an already stable model will make no difference except less examples
will be considered to reach the same solution.
A machine learning algorithm is considered accurate if it produces a model ℎ with an accuracy
greater than ½ (i.e. the decision tree results in greater accuracy than if each decision made was
a 50/50 split). Algorithms are tested to this limit by adding noise to the training data. Noisy data
is training data which contains mislabelled examples. Noise is problematic for boosting and has
been show to considerably reduce its classification performance (Long & Servedio, 2009;
Dietterich, 2000b; Dietterich, 2000a; Freund & Schapire, 1996). This poor performance is
intuitive due to the fact that the boosting method converges to harder to classify data.
Mislabelled data is obviously the hardest to classify and fruitless to focus on, hence the fatal flaw
of boosting. Critically, Long & Servedio (2009) showed that the most common boosting
algorithms such as AdaBoost and LogitBoost reduced to accuracies of less than ½ for high noise
data, rendering them meaningless. Conversely, when directly comparing bagging and boosting
methods effectiveness, Dietterich (2000b) found that bagging was “clearly” the best method.
Bagging actually uses the noise to generate a more diverse collection of decision tree hypotheses
and therefore introducing noise to the training data only improves accuracy. However,
experimental results have shown that when there is no noise in the training data, boosting gives
the best results (Banfield et al., 2007; Lemmens & Croux, 2006; Dietterich, 2000b; Freund &
Schapire, 1996). Conclusively, when deciding between machine learning algorithms, an
important factor to consider is confidence in the consistency of training data being provided.
Boosting is ideal but bagging is more consistent.

9
A possible solution to this dilemma is explored by Alfaro, Gamez & Garcia (2013) where features
of both bagging and boosting are combined in the design of a new classification decision tree
learning algorithm: adabag. The common goal of both bagging and boosting is to improve
accuracy by modifying the training data. Based off the AdaBoost algorithm, adabag allows
analysis of the error as the ensemble is grown, reducing the problem with noise.
Bagging and boosting are effective techniques for improving the predictive performance of
machine learning algorithms when applied to decision tree learning. By generating an ensemble
of decision trees and finding the optimal hypothesis analytically, accuracy is increased.
2.2.3 Random Forests
Random forests are another ensemble learning technique used to improve the performance of
algorithms in decision learning. The algorithm was originally developed by Breiman (2001) to
whom the term is trademarked. Random forests was an improvement on his previous technique,
bagging (Breiman, 1996). Instead of choosing one optimal decision tree, random forests uses
multiple and takes the mode hypothesis as the result.
Although there is no single best algorithm for every situation (Wolpert & Macready, 1997),
random forests has proved to be a general top performer without requirements for tuning or
adjustment and notably outperforms both bagging and boosting on accuracy and speed
(Banfield et al., 2007; Svetnik et al., 2003; Breiman, 2001).
Breiman (2001) found that random forests favourably shares the noise-proof properties of
bagging. When compared against AdaBoost, random forests showed little deterioration with 5%
noise whereas AdaBoost’s performance dropped markedly. This is because the Random Forest
technique does not increase weights on specific subsets and so the increased noise has negligible
effect, whilst AdaBoost’s convergence to mislabelled examples causes its accuracy to spiral.
This being said, there is always room for improvement and the Random Forest technique is by
no means perfect. The mechanism of up-voting by decision trees in the random forests is one
possible area for improvement (Robnik-Sikonja, 2004). Margin is a measure of how much a
particular hypothesis is favoured over other hypotheses from the Random Forest decision trees.
By weighting each hypothesis vote with the margin, Robnik-Sikonja (2004) found the prediction
accuracy of random forests improves significantly.
Decision trees are a natural choice in the development of machine learning programs. Within
decision trees, there a number of different algorithms and techniques including the ones
explored here plus others such as CART, CHAID and MARS. Decision trees are important
because they perform with large data sets and are intuitive to use. Furthermore, techniques such
as random forests can improve the robustness, accuracy and speed of the learning method.
3 Unsupervised Machine Learning Algorithms
In unsupervised learning, the onus of learning is even more greatly on the computer program
than the developer. Where in supervised learning you have a full set of inputs and outputs in
your data, for unsupervised learning you only have inputs. The machine learning algorithm
must use this input data alone to extract knowledge. In statistics, the equivalent problem is
known as density estimation; the problem is finding any underlying structures to the unlabelled
data (Alpaydin, 2010).

10
3.1 Clustering
The main unsupervised learning method is clustering: finding groups within the input data set.
For example, a company may want to group their current customers in order to target groups
with relevant new products and services. To do this, the company could take their database of
customers and use an unsupervised clustering algorithm to divide it into customer segments.
The company uses the results to have better relationships with their customers. In addition to
identifying groups, the algorithm will identify outliers who sit outside of these groups. These
outliers might reveal a niche that wouldn’t have otherwise been noticed.
There are over 100 published clustering algorithms. This review will focus on the two most used
approaches to clustering: hierarchical clustering and k-means clustering.
3.1.1 Hierarchical Clustering: Agglomerative and Divisive
As suggested in the name, hierarchical clustering clusters
in hierarchies. Each level of clusters in the hierarchy is a
combination of the clusters below it, whereby the
‘clusters’ at the bottom of the hierarchy are singular
observations and the top cluster contains the entire data
set (Hastie, 2009).
Hierarchical clustering is split into two sub-approaches:
agglomerative (bottom-up) and divisive (top-down) as in
Figure 3.1. In the agglomerative approach, clusters start out
as individual data inputs and are merged into larger
clusters until one cluster containing all the inputs is
reached. Divisive is the reverse, starting with the cluster containing all data inputs and
subdividing into smaller clusters until reaching
individual inputs or a termination condition
such as the distance between two of the closest
clusters is above a certain amount (Kamber,
2000). The most common form of hierarchical
clustering is agglomerative. Dendrograms
provide a highly comprehensible way of
interpreting the structure of a hierarchical
clustering algorithm in a graphical format as
illustrated in Figure 3.2.
Agglomerative hierarchical methods are
broken down into single-link methods,
complete-link methods, centroid methods and
more. The difference between these methods is
how the distance between clusters/groups is
measured.
The single-link method, also known as nearest neighbour clustering (Rohlf, 1982), can be
defined by the following distance 𝐷 linkage function (Gan, 2007):
𝐷(𝐶, 𝐶′) =
min
𝑥 ∈ 𝐶, 𝑦 ∈ 𝐶′
𝑑(𝑥, 𝑦) Eq. 3.1
Figure 3.2. Dendrogram from agglomerative (bottom up)
clustering technique based on data on human tumors.
(Hastie, 2009)
Figure 3.1. Agglomerative and divisive
hierarchical clustering. (Gan, 2007)

11
Where 𝐶 and 𝐶’ are two nonempty and non-overlapping clusters. The Euclidean distance (Gan,
2007) for 𝑛-dimensions is:
𝑑(𝑥, 𝑦) = √(𝑥1 − 𝑦1)2 + (𝑥2 − 𝑦2)2 + ⋯ + (𝑥 𝑛 − 𝑦𝑛)2 Eq. 3.2
This is used in the agglomerative approach to find clusters/groups with the minimum Euclidean
distance between them to join for the next level up in the hierarchy. This procedure repeats
until all clusters are encompassed by one cluster of the entire data set.
One of the main reasons hierarchical clustering is such a popular approach is the easily human-
interpretable dendrogram format with which it can be represented (Hastie, 2009). Additionally,
any reasonable method of measuring the distance between clusters can be used provided it can
be applied to matrices. However, hierarchical clustering occasionally encounters difficulty with
merge/split points (Kamber, 2000). In a hierarchical structure, this is critical as every point
following a merge/split is derived from that decision. Therefore, if this decision is made poorly,
the entire output will be low-quality. A number of hierarchical methods built from the
fundamentals of this approach have been designed to solve the typical issues it is prone to,
including BIRCH (Zhang, Ramakrishnan & Livny, 1997) and CURE (Yun-Tao Qian, Qing-Song
Shi & Qi Wang, 2002).
Hierarchical clustering is a simple but extremely flexible approach for applying unsupervised
learning to any data set. It can be used as an assistive tool to allow specialists to make best use
of their skill. For example, in medical applications such as analysis of EEG graphs, hierarchical
clustering is used to identify and group sections that are alike whilst the neurologist can
evaluate the medical meaning of these areas (Guess & Wilson, 2002). In this way, the work is
delegated to make best use of each individual/component: the computer does the systematic
analysis and the neurologist provides the medical insight.
3.1.2 K-means
K-means is one of the most common approaches to clustering. First demonstrated by
MacQueen (1966) it is designed for quantitative data and defines clusters by a centre point (the
mean). The algorithm begins with the initialisation phase where the number of clusters/centres
is fixed. Then the algorithm enters the iteration phase, iterating the positions of these centres
until they reach a final central rest position (Gan, 2007). The final rest position occurs when the
error function does not change significantly for further iterations. The algorithm is as follows
(Hastie, 2009):
1. For a given set of 𝑘 clusters, C, minimise the total cluster variance of all data inputs with
respect to {𝑚1, … , 𝑚 𝑘} yielding the means of current clusters.
2. Given the means of current clusters {𝑚1, … , 𝑚 𝑘}, assign each data input to the closest
(current) mean for a cluster.
3. Repeat until assignments no longer change.
The function being minimised is as follows (Hastie, 2009):
𝐶∗
=
min
𝐶, {𝑚 𝑘}1
𝐾 ∑ 𝑁𝑘 ∑ ||𝑥𝑖 − 𝑚 𝑘||
2
𝐶(𝑖)=𝑘
𝐾
𝑘=1
Eq. 3.3

12
Where 𝑥 represents the data inputs
and 𝑁𝑘 = ∑ 𝐼(𝐶(𝑖)) = 𝑘)𝑁
𝑖=1 . Therefore
𝑁 data inputs are assigned to the 𝑘
clusters so that the distance between
the data inputs and the cluster mean is
minimised.
A key advantage to using K-means is
that it is effective in terms of
computation even with large data sets.
The computational complexity is
linearly proportional to the size of the
data set, rather than exponentially
(Hastie, 2009). However, due to this
linear approach, it can be slow on high
dimensional data beyond a critical size
(Harrington, 2012; Hastie, 2009).
The performance of K-means is heavily dependent on the initialisation phase. Not only must
the number of clusters 𝑘 be defined but also the initiation positions of the centres. The number
of clusters 𝑘 depends on the goal you are trying to achieve in the analysis and is usually well
defined in the problem, for example, creating 𝑘 customer segments, employing 𝑘 sales people
etc. Alternatively, if this information is unavailable, a “rule of thumb” approach commonly taken
is to set 𝑘 proportionally to the number of inputs in the data set (Mardia, 1979):
𝑘 ≈ √
𝑁
2
Eq. 3.4
For the algorithm to perform well, it is important to take a reliable approach to defining the
cluster means. Fortunately this problem has popular solutions proposed as the Forgy Approach
(Anderberg, 1973), Macqueen Approach (MacQueen, 1966) and Kaufman Approach (Kaufman,
1990). In comparing these, it has been found that the Kaufman approach generally produces the
best clustering results (Peña, Lozano & Larrañaga, 1999). In the Kaufman Approach, the initial
cluster means are found iteratively. The starting point is the input data point closest to the
centre of the data set. Following this, centres are chosen by choosing input data point positions
with the highest number of other data points around them.
One of the earliest applications of K-means was in signal and data processing. For example, it is
used for image compression where a 24 bits image with up to 16 million colours can be
compressed to an 8 bits images with only 256 (Alpaydin, 2010). The problem is finding the
optimal 256 colours out of the 16 million in order to retain image quality in compression. This
is a problem of vector quantisation. K-means is still used for this application today.
Figure 3.3. A demonstration of iterations by the K-means clustering
algorithm for simulated input data points. (Hastie, 2009)

13
The standard K-means algorithm serves its purpose well, but suffers from some limitations and
drawbacks. For this reason, it has been modified, extended and improved in numerous
publications (Chen, Ching & Lin, 2004; Wagstaff et al., 2001). The techniques employed include
a) finding better initial solutions (as discussed above), b) modifying the original algorithm and
c) incorporating techniques from other algorithms into K-means. Wagstaff et al. (2001)
recognised that the experimenter running the algorithm is likely to have some background
knowledge on the data set being analysed. By communicating this data to the algorithm,
through adding additional constraints in the clustering process, Wagstaff et al. (2001) improved
the performance of K-means from 58% to 98.6%. In a separate experiment, Chen, Ching & Lin
(2004) found that incorporating techniques from hierarchical methods into K-means increased
clustering accuracy. This literature shows that K-means is a versatile approach to clustering
which can be tailored to specific problems in order to significantly improve its accuracy.
4 Steps in developing a machine learning application
So far this review has focused on the theoretical background of machine learning techniques.
This section considers practically applying this theoretical knowledge to data related problems
in any field of work, from collecting data through to use of the application (Harrington, 2012).
4.1 Collect Data
The first step is to collect the data you wish to analyse. Sources of data may include scraping a
website for data, extracting information from an RSS feed or API, existing databases, running
an experiment to collect data and other sources of publicly available data.
4.2 Choose Algorithm
There are a huge number of machine learning algorithms out there, so how do we choose the
right one? The first decision is between supervised learning and unsupervised learning. If you
are attempting to predict or forecast then you should use supervised learning. You will also need
training data with a set of inputs connected to outputs. Otherwise, you should consider
unsupervised learning. At the next level, choose between regression or classification (supervised
learning) and clustering or density estimation (unsupervised learning). Finally at the last level,
there are tens of different algorithms you could use under each of these categories. There is no
single best algorithm for all problems (Harrington, 2012; Wolpert & Macready, 1997).
Understanding the properties of the algorithms is helpful, but now to find the best algorithm
for your problem your strategy should be to test different algorithms and choose by trial and
error (Salter-Townshend et al., 2012).
4.3 Prepare Data
The next step is to prepare the data in a usable format. Certain algorithms require the
features/training data to be formatted in a particular way, but this is trivial. The data first needs
to be cleaned and, integrated and selected (Zhang, Zhang & Yang, 2003; Kamber, 2000).
Data cleaning involves filling out any missing values in features of the training data, removing
noise, filtering out outliers and correcting inconsistent data. To fill out missing values, you can

14
take a biased or unbiased approach. An example of biased is to use a probable value to fill in the
missing value, whereas unbiased would be just removing the feature/example completely. The
biased approach is popular when there are a large proportion of values missing. The random
error and variance in the data is caused by noise. This is reduced by binning (Shi & Yu, 2006) or
clustering the data in order to isolate and remove outliers.
Data integration is simply merging data from multiple sources. Data selection is the problem of
selecting the right data from the sample to use as the training data set. Generally the method of
selecting data is heavily dependent on the type of data being filtered, however, Sun et al. (2013)
explored an innovative generalised approach using dynamic weights for classification by putting
a greater weight on data associated with the most features and eliminating redundant ones,
demonstrating promising results.
4.4 Train Algorithm
Now that all the data is cleaned and optimised, we can proceed to train the algorithm (for
supervised learning). For unsupervised learning, this stage is just running the algorithm on the
data as we don’t have target values to train with. For both learning types, this is where the
artificially intelligent ‘machine learning’ occurs and where the real value of machine learning
algorithms is exploited (Russell, 2010). The output of this step is raw ‘knowledge’.
4.5 Verify Results
Before using the new found ‘knowledge’, it is important to verify/test it. In supervised learning,
you can test the model you’ve created against your existing real data set to measure the accuracy.
If it is not satisfactory, you can go back to the initial data preparation stages and optimise.
Verifying the accuracy of unsupervised learning algorithms is significantly more challenging
and beyond the scope of this review.
4.6 Use Application
Finally you can use the knowledge evaluated by your algorithm. Depending on the nature of
your machine learning problem, the raw data output may be sufficient or you may choose to
produce visualisations for the results (Leban, 2013).
The beauty of machine learning means that we do not need to program a solution to the
problem line by line, the machine learning algorithm will learn from data using statistical
analysis instead. But the machine learning algorithm still needs to be developed itself.
Fortunately there is no single piece of software or programming language that you must use to
prepare your machine learning application. The most commonly used applications are Python,
Octave, R and Matlab (Ng, 2014; Freitas, 2013; Alfaro, Gamez & Garcia, 2013; Harrington, 2012).
Python is one of the most widely used because of its clear syntax, simple text manipulation and
established use throughout industries and organisations (Harrington, 2012).
With this information, you are now equipped with the knowledge and practical know-how to
develop a machine learning application.

15
5 Data Mining
In the last few centuries, innovation in the human species has accelerated rapidly. With the
invention of the World Wide Web and adoption of new technologies on a global scale we are
using technology like never before. The by-product of the Information Age is vast amounts of
data, exceeding terabytes onto petabytes and exabytes, with immense hidden value (Goodman,
Kamath & Kumar, 2007). The sheer size of databases and data sets make it impossible for a
human to comprehend or analyse manually. Data mining is quite literally using machine
learning approaches to extract underlying information and knowledge from data (Kamber,
2000). The knowledge can contribute greatly to business strategies or scientific and medical
research.
The format of the knowledge extracted depends on the machine learning algorithm used. If
supervised learning approaches are applied it is possible to identify patterns in data that can be
used to model it (Kantardzic, 2011). Pattern recognition and learning is one of the most widely
applied uses for data mining and machine learning.
Unsupervised approaches are also used in data mining. Unsupervised learning makes it possible
to identify natural groupings in data. The main application of this in data mining is feature
learning whereby useful features are extracted from a large data set which can then be used for
classification (Coates & Ng, 2012).
Applications of data mining can be seen in medicine, telecommunications, finance, science,
engineering and more. For example in medicine, machine learning is frequently being used to
improve diagnosis of medical conditions such as cancer and schizophrenia. Data mining of
clinical data such as MRI scans allows computers to learn how to recognise cancers and
underlying conditions in new patients more reliably than doctors (Savage, 2012; Ryszard S
Michalski, Ivan Bratko & Miroslav Kubat, 1998). In finance, data mining is now being used to
assist evaluation of credit risk of individuals and companies ahead of providing financial support
through loans (Correia et al., 1993). This is arguably the most important stage in the process of
offering a loan but firms have previously struggled to accurately predict the risk of default. With
the large data sets that have been accumulated in this domain, data mining is providing new
insights and patterns to help accurately manage these risks for financial organisations.
Data mining does not yet have any social stigma attached to it. However, there are ethical issues
and social impacts of data mining. For example, web mining involves scraping data from the
internet and mining it for knowledge (Etzioni, 1996). This data can often include personal data
from web users which is used for the profit of organisations (the web miners) (Van Wel &
Royakkers, 2004). Current research suggests that no harm is currently being done to web users
as a result of this, but with the uprising of ‘big data’ there is growing demand for regulation and
ensuring that the power of data mining is used for ‘good’ (Etlinger, 2014). As long as users remain
in control and fully understand the data they offer when using the web, the threat to privacy
can be neutralised. However, the risk of this line of consent and understanding becoming
blurred is high. It is important for governments and organisations to acknowledge this and take
a pro-active approach with regulation.

16
6 Discussion
6.1 Literature
In writing this review it has become clear that supervised machine learning algorithms simply
apply statistical approaches to data analysis in a scalable way. In fact, one of the best technical
sources of information on regression and gradient descent was a maths textbook (Kreyszig,
2006). It provided a clear explanation of the techniques despite not directly relating them to
machine learning. This has demonstrated that machine learning has come a long way in its
scientific and mathematical approach since originally branching out of artificial intelligence.
The cause of the separation was originally due to statistical analysis no longer being supported
in artificial intelligence. However, it turned out that within these statistical analysis approaches
(machine learning) lied the most practical discoveries and applications of all.
Unsupervised learning is perhaps more closely related to artificial intelligence. The frequently
cited textbook by Russell (2010) titled “Artificial Intelligence” actually served as an excellent
source of insight into unsupervised machine learning algorithms, particularly hierarchical
algorithms and the K-means approach. This is probably because unsupervised learning deals
with the more mysterious (affiliated with artificial intelligence) type of data: unlabelled data.
Additionally, it seeks to extract knowledge or ‘intelligence’ from this data. Unsupervised
learning is particularly applicable to data mining through the application of feature learning.
With feature learning, it is possible to take a huge set of data uninterpretable by humans and
turn it into something that you can perform intricate data analysis on and obtain realised value.
It was surprising to find that with just the elemental principals covered in this review it is
possible to get started on real machine learning applications, as made apparent when discussing
the review with professionals in industry.
6.2 Future Developments
Machine learning is still a new scientific field with huge opportunities for growth and
development. Rather than working on large static data sets, it is important to devise methods
of applying machine learning to transient data and data streams (Gama, 2012). There are
significant challenges to address for maintaining an accurate decision model when the data used
to develop that model is continually changing.
It has become clear that a bias-variance trade off exists in supervised learning problems
(Sharma, Aiken & Nori, 2014). Bias and variance are both sources of error. Ideally the model
should closely fit the training data but also generalise effectively for new data. In past research,
there has been a focus on reducing the variance related error. However, as data sets grow larger
(Cambria et al., 2013), it is important to produce models which fit closely to larger data sets.
Therefore, there is a need to focus more specifically on bias related error.
We now have access to more computational power than ever before. However, when comparing
computing technology to the human brain, there is a clear discrepancy between the two in terms

17
of how fast data is processed and how much energy is consumed to do so (Norvig, 2012). A
computer can process data 100 million times faster than the brain but requires 20,000 watts of
power to do so. Comparatively, the brain consumes just 20 watts of power to do the same. Yet
machine learning systems are still only just managing to become as effective as the brain. We
need to allocate resources to understanding the brain and using it to inspire circuit and
machinery design in order to make artificial intelligence and learning processes more efficient.
7 Conclusion
There are two main approaches to machine learning: supervised learning and unsupervised
learning. These can be further broken down by different algorithms used to complete supervised
and unsupervised learning tasks. In supervised learning, types of algorithm include regression
and clustering (such as gradient descent, ID3, bagging, boosting and random forests). In
unsupervised learning, types of algorithms include hierarchical and K-means clustering.
Machine learning can be applied to facial recognition, medical diagnosis, search engines,
shopping cart recommendation systems and much more. The common indicator of a good
application is that a large source of data exists related to the problem. Machine learning
algorithms can then use their tailored decision making to translate that data into usable
knowledge, producing value.
The process of developing a machine learning algorithm is summarised as follows: start by
collecting data, choose an appropriate algorithm, prepare the data, train the algorithm with
sample data, verify the results and finally apply the knowledge produced by the algorithm.
Data mining is a growing application of machine learning as the World Wide Web and
Information Age have introduced data sets on a scale like never before. Going forward, it is
important to only use data mining ethically and not to the detriment of web users.
As most of the development in machine learning has happened in the past 30 years, there is still
much to be done. We should continue to use the human brain as a North Star in guiding further
research. The goal is to realise true artificial intelligence through improving machine learning
algorithms which may one day compete with the performance of our own brains.
8 References
Akaike, H. (1974) NEW LOOK AT THE STATISTICAL MODEL IDENTIFICATION. IEEE
Transactions on Automatic Control. AC-19 (6), 716-723.
Alfaro, E., Gamez, M. & Garcia, N. (2013) adabag: An R Package for Classification with Boosting
and Bagging. Journal of Statistical Software; J.Stat.Softw. 54 (2), 1-35.
Allaby, M. (2010) Ockham's razor, A Dictionary of Ecology. Oxford University Press.

18
Alpaydin, E. (2010) Introduction to machine learning. 2nd edition. Cambridge, Mass. ; London,
MIT Press.
Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London, Academic Press.
Ayodele, T. O. (2010) Types of Machine Learning Algorithms, New Advances in Machine
Learning, Yagang Zhang (Ed.), ISBN: 978-953-307-034-6, InTech.
Banfield, R. E., Hall, L. O., Bowyer, K. W. & Kegelmeyer, K. W. (2007) A comparison of
decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and
Machine Intelligence. 29 (1), 173-180.
Bartholomew-Biggs, M. (2008) Nonlinear Optimization with Engineering Applications.
Dordrecht, Springer.
Beyad, Y. & Maeder, M. (2013) Multivariate linear regression with missing values. Analytica
Chimica Acta. 796 (0), 38-41.
Breiman, L. (1996) Bagging predictors. Machine Learning. 24 (2), 123-140.
Breiman, L. (2001) Random Forests. Machine Learning. 45 (1), 5-32.
Cambria, E., Huang, G., Zhou, H., Vong, C., Lin, J., Yin, J., Cai, Z., Liu, Q., Li, K., Feng, L., Ong,
Y., Lim, M., Akusok, A., Lendasse, A., Corona, F., Nian, R., Miche, Y., Gastaldo, P., Zunino, R.,
Decherchi, S., Yang, X., Mao, K., Oh, B., Jeon, J., Toh, K., Kim, J., Yu, H., Chen, Y. & Liu, J.
(2013) Extreme Learning Machines. IEEE Intelligent Systems. 28 (6), 30-59.
Chen, J., Ching, R. K. H. & Lin, Y. (2004) An extended study of the K- means algorithm for data
clustering and its applications. Journal of the Operational Research Society. 55 (9), 976-987.
Coates, A. & Ng, A. Y. (2012) Learning feature representations with K- means. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes
in Bioinformatics). 7700, 561-580.
Correia, J., Costa, E., Ferreira, J. & Jamet, T. (1993) An Application of Machine Learning in the
Domain of Loan Analysis. Lecture Notes in Computer Science. 667, 414-419.
Criminisi, A. & Shotton, J. (2013) Decision Forests for Computer Vision and Medical Image
Analysis. 2013th edition.
Dietterich, T. (2000a) Ensemble methods in machine learning. Multiple Classifier Systems.
1857, 1-15.
Dietterich, T. (2000b) An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning;
Mach.Learn. 40 (2), 139-157.
Etlinger, S. (2014) What do we do with all this big data? TED.com,
https://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data.

19
Etzioni, O. (1996) The World- Wide Web: Quagmire or Gold Mine? Communications of the
ACM. 39 (11), 65-68.
Freitas, N. d. (2013) Machine Learning Lecture Course. de Freitas, Nando, University of British
Columbia, Oxford University.
Freund, Y. & Schapire, R. E. (1996) Experiments with a new boosting algorithm. ICML. pp.148-
156.
Freund, Y. (1995) BOOSTING A WEAK LEARNING ALGORITHM BY MAJORITY. Information
and Computation; Inf.Comput. 121 (2), 256-285.
Freund, Y. & Schapire, R. E. (1995) A decision- theoretic generalization of on-line learning and
an application to boosting. Lecture Notes in Computer Science. 904, 23-37.
Gama, J. (2012) A survey on learning from data streams: current and future trends. Progress in
Artificial Intelligence. 1 (1), 45-55.
Gan, G. (2007) Data clustering : theory, algorithms, and applications. Philadelphia, PA, Society
for Industrial and Applied Mathematics.
Goodman, A., Kamath, C. & Kumar, V. (2007) Statistical analysis and data mining: Data
Analysis in the 21st Century. Statistical Analysis and Data Mining. , .
Guess, M. J. & Wilson, S. B. (2002) Introduction to hierarchical clustering. Journal of Clinical
Neurophysiology. 19 (2), 144-151.
Harrington, P., 1977-. (2012) Machine learning in action. Shelter Island, N.Y., Manning
Publications.
Hastie, T. (2009) The elements of statistical learning : data mining, inference, and prediction.
2nd edition. New York, Springer.
Kamber, M. (2000) Data mining: concepts and techniques. San Francisco ; London, San
Francisco ; London Morgan Kaufmann.
Kantardzic, M. (2011) Data Mining Concepts, Models, Methods, and Algorithms. 2nd edition.
Hoboken, Wiley.
Kaufman, L. (1990) Finding groups in data an introduction to cluster analysis. S.l.}, Wiley.
Kiwiel, K. C. (2001) Convergence and efficiency of subgradient methods for quasiconvex
minimization. Mathematical Programming, Series B. 90 (1), 1-25.
Kreyszig, E. (2006) Advanced engineering mathematics. 9th, International edition. Hoboken,
N.J., Wiley.
Larose, D. T. (2005) k ‐ Nearest Neighbor Algorithm. Hoboken, NJ, USA.
Leban, G. (2013) Information visualization using machine learning. Informatica (Slovenia). 37
(1), 109-110.

20
Lemmens, A. & Croux, C. (2006) Bagging and boosting classification trees to predict churn.
Journal of Marketing Research. , .
Long, P. M. & Servedio, R. A. (2009) Random classification noise defeats all convex potential
boosters. Machine Learning. , 1-18.
MacQueen, J. B. (1966) SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF
MULTIVARIATE OBSERVATIONS.
Mardia, K. V. (1979) Multivariate analysis. London, Academic Press.
Mingers, J. (1989) An empirical comparison of selection measures for decision-tree induction.
Machine Learning. 3 (4), 319-342.
Mitchell, T. M. (. M., 1951-. (1997) Machine learning. Boston, Mass., WCB/McGraw-Hill.
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. (2004) An introduction to
decision tree modeling. Journal of Chemometrics. 18 (6), 275-285.
Ng, A. (2014) Machine Learning (Coursera) - Stanford by Andrew Ng. , coursera.org.
Norvig, P. (2012) Artificial intelligence: A new future. New Scientist. 216 (2889), vi-vii.
Peña, J. M., Lozano, J. A. & Larrañaga, P. (1999) An empirical comparison of four initialization
methods for the K-Means algorithm. Pattern Recognition Letters. 20 (10), 1027-1040.
Pino-Mejas, R., Cubiles-de-la-Vega, M., Lapez-Coello, M., Silva-Ramarez, E. & Jimanez-
Gamero, M. (2004) Bagging Classification Models with Reduced Bootstrap. In: Fred, A., Caelli,
T., Duin, R. W., Campilho, A. & de Ridder, D. (eds.). , Springer Berlin Heidelberg. pp. 966-973.
Quinlan, J. R. (1993) C4.5 : programs for machine learning. Amsterdam, Morgan Kaufmann.
Quinlan, J. R. (1986) Induction of decision trees. Machine Learning. 1 (1), 81-106.
Robnik-Sikonja, M. (2004) Improving random forests. Machine Learning: Ecml 2004,
Proceedings. 3201, 359-370.
Rohlf, F. J. (1982) 12 Single- link clustering algorithms. Handbook of Statistics. 2, 267-284.
Russell, S. J. (. J. (2010) Artificial intelligence : a modern approach. 3rd, International edition.
Boston, Mass.] ; London, Pearson.
Ryszard S Michalski, Ivan Bratko & Miroslav Kubat. (1998) Machine learning and data mining :
methods and applications. Chichester, Chichester : Wiley.
Salter-Townshend, M., White, A., Gollini, I. & Murphy, T. B. (2012) Review of statistical
network analysis: models, algorithms, and software. Statistical Analysis and Data Mining. 5
(4), 243-264.
Savage, N. (2012) Better Medicine Through Machine Learning. Communications of the ACM. 55
(1), 17-19.

21
Shao, X., Zhang, G., Li, P. & Chen, Y. (2001) Application of ID3 algorithm in knowledge
acquisition for tolerance design. Journal of Materials Processing Tech. 117 (1), 66-74.
Sharma, R., Aiken, A. & Nori, A. V. (2014) Bias- variance tradeoffs in program analysis.
Shi, T. & Yu, B. (2006) Machine Learning and Data Mining - Binning in Gaussian kernel
regularization. Statistica Sinica. 16 (2), 541-568.
Skurichina, M. & Duin, R. P. W. (1998) Bagging for linear classifiers. Pattern Recognition. 31
(7), 909-930.
Snyman, J. A. (2005) Practical Mathematical Optimization An Introduction to Basic
Optimization Theory and Classical and New Gradient-based Algorithms. Dordrecht, Springer-
Verlag New York Inc.
Stigler, S. M. (1981) Gauss and the Invention of Least Squares. The Annals of Statistics. 9 (3),
465-474.
Sun, X., Liu, Y., Chen, H., Han, J., Wang, K. & Xu, M. (2013) Feature selection using dynamic
weights for classification. Knowledge-Based Systems. 37, 541-549.
Svetnik, V., Liaw, A., Tong, C., Culberson, J., Sheridan, R. & Feuston, B. (2003) Random forest:
A classification and regression tool for compound classification and QSAR modeling. Journal
of Chemical Information and Computer Sciences; J.Chem.Inf.Comput.Sci. 43 (6), 1947-1958.
Van Wel, L. & Royakkers, L. (2004) Ethical issues in web data mining. Ethics and Information
Technology. 6 (2), 129-140.
Wagstaff, K., Cardie, C., Rogers, S. & Schrödl, S. (2001) Constrained k-means clustering with
background knowledge. ICML. pp.577-584.
Wolpert, D. H. & Macready, W. G. (1997) No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation. 1 (1), 67-82.
Yan, K., Zhu, J. & Qiang, S. (2007) The application of ID3 algorithm in aviation marketing.
Yun-Tao Qian, Y. Q., Qing-Song Shi, Q. S. & Qi Wang, Q. W. (2002) CURE-NS: a hierarchical
clustering algorithm with new shrinking scheme.
Zhang, S. C., Zhang, C. Q. & Yang, Q. (2003) Data preparation for data mining. Applied
Artificial Intelligence. 17 (5-6), 375-381.
Zhang, T., Ramakrishnan, R. & Livny, M. (1997) BIRCH: A New Data Clustering Algorithm and
Its Applications. Data Mining and Knowledge Discovery. 1 (2), 141-182.

22
9 Acknowledgements
The author would like to acknowledge and thank Dr Frederic Cegla (Senior Lecturer at Imperial
College London) for his supervision of this literature review project. Additionally, Shaun
Dowling (Co-founder at Interpretive.io), Barney Hussey-Yeo (Data Scientist at Wonga), Ferenc
Huszar (Data Scientist at Balderton Capital) and Joseph Root (Co-founder at Permutive.com)
for sharing their insights on machine learning.

Guy Riese Literature Review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Guy Riese Literature Review

Similar to Guy Riese Literature Review (20)

Guy Riese Literature Review