mapReduce for machine learning

MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 1
1 Introduction
Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning
to hit a power limit as device geometries shrink due to leakage, and simply because CMOS
consumes power every time it changes state. Yet Moore’s law, the density of circuits doubling
every generation, is projected to last between 10 and 20 more years for silicon based circuits, but
doubling the number of processing cores on a chip, one can maintain lower power while
doubling the speed of many applications. This has forced an industry-wide shift to multicore.
We thus approach an era of increasing numbers of cores per chip, but there is as yet no good
frame-work for machine learning to take advantage of massive numbers of cores. There are many
parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but
none of these approaches make it obvious how to parallelize a particular algorithm. There is a
vast literature on distributed learning and data mining, but very little of this literature focuses on
our goal: A general means of programming machine learning on multicore. Much of this
literature contains a long and distinguished tradition of developing (often ingenious) ways to
speed up or parallelize individual learning algorithms, for instance cascaded parallelization
technique for machine learning and, more pragmatically, specialized implementations of popular
algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et.
al. give some general data distribution conditions for parallelizing machine learning, but restrict
the focus to decision trees; Jin and Agrawal give a general machine learning programming
approach, but only for shared memory machines. This doesn’t fit the architecture of cellular or
grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated.
In this paper, we focuses on developing a general and exact technique for parallel
programming of a large class of machine learning algorithms for multicore processors. The
central idea of this approach is to allow a future programmer or user to speed up machine
learning applications by “throwing more cores” at the problem rather than search for specialized
optimizations. This paper’s contributions are :(i) We show that any algorithm fitting the
Statistical Query Model may be written in a certain “summation form.” This form does not
change the underlying algorithm and so is not an approximation, but is instead an exact
implementation. (ii) The summation form does not depend on, but can be easily expressed in a
map-reduce framework which is easy to program in. (iii) this technique achieves basically linear
speed-up with the number of cores.

2 Machine learning
“Machine learning” sounds mysterious for most people. Indeed, only a small fraction of
professionals really know what it stands for. And there is a serious reason for it – this field is
rather technical and difficult to explain to a layman. However, we would like to bridge this gap
and explain a bit about what machine learning (ML) is and how it can be used in our everyday
life or business.
So what is this mysterious ML?
Machine learning can refer to:
• The branch of artificial intelligence;
• The methods used in this field (there are a variety of different approaches).
Overall, if talking about the latter, Tom Mitchell, author of the well-known book “Machine
learning”, defines ML as “improving performance in some task with experience”. However, this
definition is quite a broad one, so we can quote another more specific description stating that ML
deals with systems that can learn from data.
ML works with data and processes it to discover patterns that can be later used to analyze new
data. ML usually relies on specific representation of data, a set of “features” that are
understandable for a computer. For example, if we are talking about text it should be represented
through the words it contains or some other characteristics such as length of the text, number of
emotional words etc. This presentation depends on the task you are dealing with and is typically
referred to as “feature extraction”.
Types of ML
All ML tasks can be classified in several categories, the main ones are:
• supervised ML;
• Unsupervised ML;
• Reinforcement learning.

Now let us explain in simple words the kind of problems that are dealt with by each category.
Supervised ML relies on data where the true label/class was indicated. This is easier to explain
using an example. Let us imagine that we want to teach a computer to distinguish pictures of cats
and dogs. We can ask some of our friends to send us pictures of cats and dogs adding a tag ‘cat’
or ‘dog’. Labeling is usually done by human annotators to ensure a high quality of data. So now
we know the true labels of the pictures and can use this data to “supervise” our algorithm in
learning the right way to classify images. Once our algorithm learns how to classify images we
can use it on new data and predict labels (‘cat’ or ‘dog’ in our case) on previously unseen
images.
2.1 Applications to Machine Learning
Many standard machine learning algorithms follow one of a few canonical data processing
patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks,
illuminating the benefits that the MapReduce framework offers to the machine learning
community.
In this section, we investigate the performance trade-offs of using MapReduce from an algorithm
centric perspective, considering in turn three classes of ML algorithms and the issues of adapting
each to a MapReduce framework. The performance that results depends intimately on the design
choices underlying the MapReduce implementation, and how well those choices support the data
processing pattern of the ML algorithm. We conclude this section with a discussion of changes
and extensions to the Hadoop MapReduce implementation that would benefit the machine
learning community.
2.1.1 A Taxonomy of Standard Machine Learning Algorithms
While ML algorithms can be classified on many dimensions, the one we take primary interest in
here is that of procedural character: the data processing pattern of the algorithm. Here, we
consider single-pass, iterative and query-based learning techniques, along with several example
algorithms and applications.

2.1.2 Single-pass Learning
Many ML applications make only one pass through a data set, extracting relevant statistics for
later use during inference. This relatively simple learning setting arises often in natural language
processing, from machine translation to information extraction to spam filtering. These
applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of
local contributions to the map task, then combining those contributions to compute relevant
statistics about the dataset as a whole. Consider the following examples, illustrating common
decompositions of these statistics.
Estimating Language Model Multinomial: Extracting language models from a large corpus
amounts to little more than counting n-grams, though some parameter smoothing over the
statistics is also common. The map phase enumerates the n-grams in each training instance
(typically a sentence or paragraph), and the reduce function counts instances of n-grams.2This
option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this
semester.
Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes
classifier, or any fully observed Bayes net, again requires counting occurrences in the training
data. In this case, however, feature extraction is often computation-intensive, perhaps involving
small search or optimization problems for each datum. The reduce task, however, remains a
summation of each (feature, label) environment pair.
Syntactic Translation Modeling: Generating a syntactic model for machine translation is an
example of a research-level machine learning application that involves only a single pass through
a preprocessed training set. Each training datum consists of a pair of sentences in two languages,
an estimated alignment between the words in each, and an estimated syntactic parse tree for one
sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves
search over these coupled data structures.
2.1.3 Iterative Learning
The class of iterative ML algorithms – perhaps the most common within the machine learning
research community – can also be expressed within the framework of MapReduce by chaining

together multiple MapReduce tasks. While such algorithms vary widely in the type of operation
they perform on each datum (or pair of data) in a training set, they share the common
characteristic that a set of parameters is matched to the data set via iterative improvement. The
update to these parameters across iterations must again decompose into per-datum contributions,
which is the case of the example applications below. As with the examples discussed in the
previous section, the reduce function is considerably less compute-intensive than the map tasks.
In the examples below, the contribution to parameter updates from each datum (the mapfunction)
depends in a meaningful way on the output of the previous iteration. For example, the
expectation computation of EM or the inference computation in an SVM or perceptron classifier
can reference a large portion or all of the parameters generated by the algorithm. Hence, these
parameters must remain available to the map tasks in a distributed environment. The information
necessary to compute the map step of each algorithm is described below; the complications that
arise because this information is vital to the computation are investigated later in the paper.
Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of
a training set given a generative model with latent variables. The E-step of the algorithm
computes posterior distributions over the latent variables given current model parameters and the
observed data. The maximization step adjusts model parameters to maximize the likelihood of
the data assuming that latent variables take on their expected values. Projecting onto the
MapReduce framework, the map task computes posterior distributions over the latent variables
of a datum using current model parameters; the maximization step is performed as a single
reduction, which sums the sufficient statistics and normalizes to produce updated parameters.
We consider applications for machine translation and speech recognition. For multivariate
Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean
vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),parameters
are also needed to specify the state transition probabilities; the models, efficiently3Generating
appropriate training data for this task involves several applications of iterative learning
algorithms, described in the following section.

Stored in binary form, occupy tens of megabytes. For word alignment models (e.g., machine
translation), these parameters include word-to-word translation probabilities; these can number
in the millions, even after pruning heuristics remove the unnecessary parameters.
Discriminative Classification and Regression: When fitting model parameters via a
perceptron, boosting, or support vector machine algorithm for classification or regression, the
map stage of training will involve computing inference over the training example given the
current model parameters. Similar to the EM case, a subset of the parameters from the previous
iteration must be available for inference. However, the reduce stage typically involves summing
over parameter changes.
Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical
featured setting that often extracts hundreds or thousands of features from each training example,
the relevant parameter space needed for inference can be quite large.
2.1.4 Query-based Learning with Distance Metrics
Finally, we consider distance-based ML applications that directly reference the training set
during inference, such as the nearest-neighbor classifier. In this setting, the training data are the
parameters, and a query instance must be compared to each training datum.
While it is the case that multiple query instances can be processed simultaneously within a
MapReduce implementation of these techniques, the query set must be broadcast to all map
tasks. Again, we have a need for the distribution of state information. However, in this case, the
query information that must be distributed to all map tasks needn’t be processed concurrently – a
query set can be broken up and processed over multiple MapReduce operations. In the examples
below, each query instance tends to be of a manageable size.
K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query
set to each element of a training set, and discovers examples with minimal distances from the
queries. The map stage computes distance metrics, while the reduce stage tracks k examples for
each label that have minimal distance to the query.

Similarity-based Search Finding the most similar instances to a given query has a similar
character, sifting through the training set to find examples that minimize a distance metric.
Computing the distance is the map stage, while minimizing it is the reduce stage.
2.2 Performance and Implementation Issues
While the algorithms discussed above can all be implemented in parallel using the MapReduce
abstraction, our example applications from each category revealed a set of implementation
challenges.
We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we
will address issues related both to the Hadoop implementation of MapReduce and the
MapReduce framework itself.
2.2.1 Single-pass Learning
The single-pass learning algorithms described in the previous section are clearly amenable to the
MapReduce framework. We will focus here on the task of generating a syntactic translation
model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic
structure
160
140
Local MapReduce
3-node MapReduce
120
Local Reference
Seconds
100
80
60
40
20
0
0 10 20 30 40 50 60
Training (Thousands of sentence pairs)
Fig1: the benefit of distributed computation quickly outweighs the overhead of a mapReduce implementation on a 3-node cluster
Figure1 shows the running times for various input sizes, demonstrating the overhead of running

MapReduce relative to the reference implementation. The cost of running Hadoop cancels out
some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed-
up of 39% over the reference implementation. The overhead of simulating a MapReduce
computation on a single machine was 51% of the compute cost of the reference implementation.
Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two
machines would give virtually no benefit for the largest data set size we tested.
A more promising metric shows that as the size of the data scales, the distributed MapReduce
implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of
each example by comparing the slopes of the curves in figure 1, which are all near-linear. The
reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples,
while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000
examples. So, the variable overhead of MapReduce is minimal for this task, while the static
overhead of distributing the code base and channeling the processing through Hadoop’s
infrastructure is large. We would expect that substantially increasing the size of the training set
would accentuate the utility of distributing the computation.
Thus far, we have assumed that the training data was already written to Hadoop’s distributed file
System (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks
of Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus
need only be distributed to Hadoop’s file system once, and can be processed by many different
applications.
On the other hand, this application references four different input sources, including sentences,
alignments and parse trees for each example. When copying these resources independently to the
DFS, the Hadoop implementation gives no control over how those files are mapped to remote
machines. Hence, no one machine necessarily contains all of the data necessary for a given
example.
Apache mahout is a framework for implementing machine learning in mapreduce paradigm.
3 Apache mahout
Apache Mahout is a project of the Apache Software Foundation to produce free implementations
of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of

collaborative filtering, clustering and classification. Many of the implementations use the
Apache Hadoop platform. Mahout also provides Java libraries for common maths operations
(focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in
progress; the number of implemented algorithms has grown quickly. But various algorithms are
still missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative
filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not
restrict contributions to Hadoop based implementations. Contributions that run on a single node
or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering
recommender component of Mahout was originally a separate project and can run stand-alone
without Hadoop. Integration with initiatives such as the Pregel-like graphs is actively under
discussion.
Mahout Algorithms include many new implementations built for speed on Mahout-Samsara.
They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find
robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative
filtering. The new spark-item similarity enables the next generation of co-occurrence
recommenders that can use entire user click streams and context in making recommendations.
3.1 Mahout installation
Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch
based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce
paradigm. However we do not restrict contributions to Hadoop based implementations:
Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The
core libraries are highly optimized to allow for good performance also for non-distributed
algorithms. Scalable to support your business case. Mahout is distributed under a commercially
friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on the project itself but also on
potential use cases. Come to the mailing lists to find out more. Currently Mahout supports
mainly four use cases: Recommendation mining takes users' behavior and from that tries to find

items users might like. Clustering takes e.g. text documents and groups them into groups of
topically related documents. Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign unlabeled documents to the
(hopefully) correct category. Frequent item set mining takes a set of item groups (terms in a
query session, shopping cart content) and identifies, which individual items usually appear
together.
3.1.1 Apache Mahout Integration
The flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with
Hadoop is invaluable. Therefore, I have packaged the most recent stable release of Mahout (0.5),
and very excited to work with the Mahout community becoming much more involved with the
project as both Mahout& Hadoop continue to grow.
3.1.2 Why we are packing Mahout with Hadoop
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra,
Analysis of Algorithms, and many other subjects. This field allows us to examine things such as
recommendation engines involving new friends, love interests, and new products. We can do
incredibly advanced analysis around genetic sequencing and examination, distributed search and
frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular
value decomposition (SVD). Apache Mahout is an open source project that is a part of the
Apache Software Foundation, devoted to Machine Learning. Mahout can operate on top of
Hadoop, which allows the user to apply the concept of Machine Learning via a selection of
algorithms in Mahout to distributed computing via Hadoop. Mahout packages popular machine
learning algorithms such as:
 Recommendation mining takes users’ behavior and finds items said specified user might like.
 Clustering, takes e.g. text documents and groups them based on related document topics.
 Classification learns from existing categorized documents what specific category documents
look like and is able to assign unlabeled documents to the appropriate category.

 Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping
cart content) and identifies, which individual items typically appear together.
We are very excited to be working with the Apache Mahout community and highly encourage
everyone who is using CDH currently to give Mahout a try! As always, we are open to any
guests who would like to blog about their experience using Mahout with CDH.
3.2 Installing Mahout
Mahout is an acquisition of highly scalable machine learning algorithms over very large data
sets. Although the real power of Mahout can be vouched for only on large HDFS data, but
Mahout also supports running algorithm on local file system data that can help you get a feel of
how to run Mahout Algorithms. Before you can run any Mahout algorithm you need a Mahout
Installation ready on your Linux machine which can be carried out in two ways as described
below
Step 1:-
We will download mahout-distribution-0.x.tar.gz Download mahout-distribution-
0.x.tar.gz from the Apache Download Mirrors and extract the contents of the Hadoop package to
a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the
files to the hduser user and hadoop group, for example:
cd /usr/local
$ sudo tar xzf mahout-distribution-0.x.tar.gz
$ sudo mv mahout-distribution-0.x.tar.gz to mahout
$ sudo chown -R hduser:hadoop mahout
This should result in a folder with name
/path_to_downloaded_tarball/mahout-distribution-0.x

Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted
folder. For testing your installation, you can also run
bin/mahout
without any other arguments.
Now we will set the path in the .bashrc file
export MAHOUT_HOME=/usr/local/mahout
path=$path:$MAHOUT_HOME/bin
Step 2:-
Create a directory where you would want to check out the
Mahout code, we’ll call it here MAHOUT_HOME:
$ sudo mkdir -p /app/mahout
$ Sudo chown hduser: hadoop /app/mahout
# ...and if you want to tighten up security, chmod from 755 to 750..
.
$ sudo chmod 750 /app/mahout
Step: 3-
Now we will set Hadoop_confi path in in .hadoop.env.sh

/usr/local/mahout/lib/*
Step 4:-
Now we have INSTALLATION OF MAVEN.
$ sudo tar xzf apache-maven-2.0.9-bin.tar.gz
$ sudo mv apache-maven-2.0.9-bin.tar.gz maven
$ sudo chown -R hduser:hadoop maven
Now we will set the path in the .bashrc file
export M2_HOME=/usr/local/maven
path=$path:$M2_HOME/bin
Now we can Start Mahout by .bin/mahout command now we have complete mahout
installation. Now we can use mahout with hadoop configuration.
Fig2: showing maven version

4 Advantages and Disadvantages
Advantages
The paper "Map-Reduce for Machine Learning on Multicore" shows 10 machine learning
algorithms, which can benefit from map reduce model. The key point is "any algorithm fitting
the Statistical Query Model may be written in a certain “summation form.”, and the algorithms
can be expressed as summation form can apply map reduce programming model.
Disadvantages
The MapReduce does not work when there are computational dependencies in the data. This
limitation makes it difficult to represent algorithms that operate on structured models.
As a consequence, when confronted with large scale problems, we often abandon rich structured
models in favor of overly simplistic methods that are amenable to the MapReduce abstraction.
In Machine-learning community, numerous algorithms iteratively transform parameters during
both learning and inference, e.g., Belief Propagation, Expectation Maximization, Gradient
Descent and Gibbs Sampling. Those algorithms iteratively refine a set of parameters until some
termination criteria is matched.
If you invoke MapReduce in each iteration, you still can speed up the computation. The point here is
that we want a better abstraction framework so that it's possible to embrace the graphical structure of
data, to express sophisticated scheduling or automatically assess termination.

6 Conclusions
By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable
gateway to parallelizing machine learning applications. The benefits of easy development
and robust computation did come at a price in terms of performance, however. Furthermore,
proper tuning of the system led to substantial variance in performance.
Defining common settings for machine learning algorithms lead us to discover the core
shortcomings of Hadoop’s MapReduce implementation. We were able to address the most
significant, the need to broadcast data to map tasks. We also greatly improved the
convenience of running MapReduce jobs from within existing Java applications. However,
the ability to tie together the distribution of parallel files on the DFS remains an outstanding
challenge.
All in all, MapReduce represents a promising direction for future machine learning
implementations. When continuing to develop Hadoop and tools that surround it, we must
strive to minimize the compromises between convenience and performance, providing a
platform that allows for efficient processing and rapid application development.

5 References
 http://mahout.apache.org/
 https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu-
for-beginners/
 http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html
 https://help.ubuntu.com/community/Java
 http://mahout.apache.org/developers/buildingmahout.html

mapReduce for machine learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to mapReduce for machine learning

Similar to mapReduce for machine learning (20)

Recently uploaded

Recently uploaded (20)

mapReduce for machine learning