SlideShare a Scribd company logo
1 of 16
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 1
1 Introduction
Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning
to hit a power limit as device geometries shrink due to leakage, and simply because CMOS
consumes power every time it changes state. Yet Moore’s law, the density of circuits doubling
every generation, is projected to last between 10 and 20 more years for silicon based circuits, but
doubling the number of processing cores on a chip, one can maintain lower power while
doubling the speed of many applications. This has forced an industry-wide shift to multicore.
We thus approach an era of increasing numbers of cores per chip, but there is as yet no good
frame-work for machine learning to take advantage of massive numbers of cores. There are many
parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but
none of these approaches make it obvious how to parallelize a particular algorithm. There is a
vast literature on distributed learning and data mining, but very little of this literature focuses on
our goal: A general means of programming machine learning on multicore. Much of this
literature contains a long and distinguished tradition of developing (often ingenious) ways to
speed up or parallelize individual learning algorithms, for instance cascaded parallelization
technique for machine learning and, more pragmatically, specialized implementations of popular
algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et.
al. give some general data distribution conditions for parallelizing machine learning, but restrict
the focus to decision trees; Jin and Agrawal give a general machine learning programming
approach, but only for shared memory machines. This doesn’t fit the architecture of cellular or
grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated.
In this paper, we focuses on developing a general and exact technique for parallel
programming of a large class of machine learning algorithms for multicore processors. The
central idea of this approach is to allow a future programmer or user to speed up machine
learning applications by “throwing more cores” at the problem rather than search for specialized
optimizations. This paper’s contributions are :(i) We show that any algorithm fitting the
Statistical Query Model may be written in a certain “summation form.” This form does not
change the underlying algorithm and so is not an approximation, but is instead an exact
implementation. (ii) The summation form does not depend on, but can be easily expressed in a
map-reduce framework which is easy to program in. (iii) this technique achieves basically linear
speed-up with the number of cores.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 2
2 Machine learning
“Machine learning” sounds mysterious for most people. Indeed, only a small fraction of
professionals really know what it stands for. And there is a serious reason for it – this field is
rather technical and difficult to explain to a layman. However, we would like to bridge this gap
and explain a bit about what machine learning (ML) is and how it can be used in our everyday
life or business.
So what is this mysterious ML?
Machine learning can refer to:
• The branch of artificial intelligence;
• The methods used in this field (there are a variety of different approaches).
Overall, if talking about the latter, Tom Mitchell, author of the well-known book “Machine
learning”, defines ML as “improving performance in some task with experience”. However, this
definition is quite a broad one, so we can quote another more specific description stating that ML
deals with systems that can learn from data.
ML works with data and processes it to discover patterns that can be later used to analyze new
data. ML usually relies on specific representation of data, a set of “features” that are
understandable for a computer. For example, if we are talking about text it should be represented
through the words it contains or some other characteristics such as length of the text, number of
emotional words etc. This presentation depends on the task you are dealing with and is typically
referred to as “feature extraction”.
Types of ML
All ML tasks can be classified in several categories, the main ones are:
• supervised ML;
• Unsupervised ML;
• Reinforcement learning.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 3
Now let us explain in simple words the kind of problems that are dealt with by each category.
Supervised ML relies on data where the true label/class was indicated. This is easier to explain
using an example. Let us imagine that we want to teach a computer to distinguish pictures of cats
and dogs. We can ask some of our friends to send us pictures of cats and dogs adding a tag ‘cat’
or ‘dog’. Labeling is usually done by human annotators to ensure a high quality of data. So now
we know the true labels of the pictures and can use this data to “supervise” our algorithm in
learning the right way to classify images. Once our algorithm learns how to classify images we
can use it on new data and predict labels (‘cat’ or ‘dog’ in our case) on previously unseen
images.
2.1 Applications to Machine Learning
Many standard machine learning algorithms follow one of a few canonical data processing
patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks,
illuminating the benefits that the MapReduce framework offers to the machine learning
community.
In this section, we investigate the performance trade-offs of using MapReduce from an algorithm
centric perspective, considering in turn three classes of ML algorithms and the issues of adapting
each to a MapReduce framework. The performance that results depends intimately on the design
choices underlying the MapReduce implementation, and how well those choices support the data
processing pattern of the ML algorithm. We conclude this section with a discussion of changes
and extensions to the Hadoop MapReduce implementation that would benefit the machine
learning community.
2.1.1 A Taxonomy of Standard Machine Learning Algorithms
While ML algorithms can be classified on many dimensions, the one we take primary interest in
here is that of procedural character: the data processing pattern of the algorithm. Here, we
consider single-pass, iterative and query-based learning techniques, along with several example
algorithms and applications.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 4
2.1.2 Single-pass Learning
Many ML applications make only one pass through a data set, extracting relevant statistics for
later use during inference. This relatively simple learning setting arises often in natural language
processing, from machine translation to information extraction to spam filtering. These
applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of
local contributions to the map task, then combining those contributions to compute relevant
statistics about the dataset as a whole. Consider the following examples, illustrating common
decompositions of these statistics.
Estimating Language Model Multinomial: Extracting language models from a large corpus
amounts to little more than counting n-grams, though some parameter smoothing over the
statistics is also common. The map phase enumerates the n-grams in each training instance
(typically a sentence or paragraph), and the reduce function counts instances of n-grams.2This
option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this
semester.
Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes
classifier, or any fully observed Bayes net, again requires counting occurrences in the training
data. In this case, however, feature extraction is often computation-intensive, perhaps involving
small search or optimization problems for each datum. The reduce task, however, remains a
summation of each (feature, label) environment pair.
Syntactic Translation Modeling: Generating a syntactic model for machine translation is an
example of a research-level machine learning application that involves only a single pass through
a preprocessed training set. Each training datum consists of a pair of sentences in two languages,
an estimated alignment between the words in each, and an estimated syntactic parse tree for one
sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves
search over these coupled data structures.
2.1.3 Iterative Learning
The class of iterative ML algorithms – perhaps the most common within the machine learning
research community – can also be expressed within the framework of MapReduce by chaining
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 5
together multiple MapReduce tasks. While such algorithms vary widely in the type of operation
they perform on each datum (or pair of data) in a training set, they share the common
characteristic that a set of parameters is matched to the data set via iterative improvement. The
update to these parameters across iterations must again decompose into per-datum contributions,
which is the case of the example applications below. As with the examples discussed in the
previous section, the reduce function is considerably less compute-intensive than the map tasks.
In the examples below, the contribution to parameter updates from each datum (the mapfunction)
depends in a meaningful way on the output of the previous iteration. For example, the
expectation computation of EM or the inference computation in an SVM or perceptron classifier
can reference a large portion or all of the parameters generated by the algorithm. Hence, these
parameters must remain available to the map tasks in a distributed environment. The information
necessary to compute the map step of each algorithm is described below; the complications that
arise because this information is vital to the computation are investigated later in the paper.
Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of
a training set given a generative model with latent variables. The E-step of the algorithm
computes posterior distributions over the latent variables given current model parameters and the
observed data. The maximization step adjusts model parameters to maximize the likelihood of
the data assuming that latent variables take on their expected values. Projecting onto the
MapReduce framework, the map task computes posterior distributions over the latent variables
of a datum using current model parameters; the maximization step is performed as a single
reduction, which sums the sufficient statistics and normalizes to produce updated parameters.
We consider applications for machine translation and speech recognition. For multivariate
Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean
vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),parameters
are also needed to specify the state transition probabilities; the models, efficiently3Generating
appropriate training data for this task involves several applications of iterative learning
algorithms, described in the following section.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 6
Stored in binary form, occupy tens of megabytes. For word alignment models (e.g., machine
translation), these parameters include word-to-word translation probabilities; these can number
in the millions, even after pruning heuristics remove the unnecessary parameters.
Discriminative Classification and Regression: When fitting model parameters via a
perceptron, boosting, or support vector machine algorithm for classification or regression, the
map stage of training will involve computing inference over the training example given the
current model parameters. Similar to the EM case, a subset of the parameters from the previous
iteration must be available for inference. However, the reduce stage typically involves summing
over parameter changes.
Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical
featured setting that often extracts hundreds or thousands of features from each training example,
the relevant parameter space needed for inference can be quite large.
2.1.4 Query-based Learning with Distance Metrics
Finally, we consider distance-based ML applications that directly reference the training set
during inference, such as the nearest-neighbor classifier. In this setting, the training data are the
parameters, and a query instance must be compared to each training datum.
While it is the case that multiple query instances can be processed simultaneously within a
MapReduce implementation of these techniques, the query set must be broadcast to all map
tasks. Again, we have a need for the distribution of state information. However, in this case, the
query information that must be distributed to all map tasks needn’t be processed concurrently – a
query set can be broken up and processed over multiple MapReduce operations. In the examples
below, each query instance tends to be of a manageable size.
K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query
set to each element of a training set, and discovers examples with minimal distances from the
queries. The map stage computes distance metrics, while the reduce stage tracks k examples for
each label that have minimal distance to the query.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 7
Similarity-based Search Finding the most similar instances to a given query has a similar
character, sifting through the training set to find examples that minimize a distance metric.
Computing the distance is the map stage, while minimizing it is the reduce stage.
2.2 Performance and Implementation Issues
While the algorithms discussed above can all be implemented in parallel using the MapReduce
abstraction, our example applications from each category revealed a set of implementation
challenges.
We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we
will address issues related both to the Hadoop implementation of MapReduce and the
MapReduce framework itself.
2.2.1 Single-pass Learning
The single-pass learning algorithms described in the previous section are clearly amenable to the
MapReduce framework. We will focus here on the task of generating a syntactic translation
model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic
structure
160
140
Local MapReduce
3-node MapReduce
120
Local Reference
Seconds
100
80
60
40
20
0
0 10 20 30 40 50 60
Training (Thousands of sentence pairs)
Fig1: the benefit of distributed computation quickly outweighs the overhead of a mapReduce implementation on a 3-node cluster
Figure1 shows the running times for various input sizes, demonstrating the overhead of running
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 8
MapReduce relative to the reference implementation. The cost of running Hadoop cancels out
some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed-
up of 39% over the reference implementation. The overhead of simulating a MapReduce
computation on a single machine was 51% of the compute cost of the reference implementation.
Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two
machines would give virtually no benefit for the largest data set size we tested.
A more promising metric shows that as the size of the data scales, the distributed MapReduce
implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of
each example by comparing the slopes of the curves in figure 1, which are all near-linear. The
reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples,
while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000
examples. So, the variable overhead of MapReduce is minimal for this task, while the static
overhead of distributing the code base and channeling the processing through Hadoop’s
infrastructure is large. We would expect that substantially increasing the size of the training set
would accentuate the utility of distributing the computation.
Thus far, we have assumed that the training data was already written to Hadoop’s distributed file
System (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks
of Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus
need only be distributed to Hadoop’s file system once, and can be processed by many different
applications.
On the other hand, this application references four different input sources, including sentences,
alignments and parse trees for each example. When copying these resources independently to the
DFS, the Hadoop implementation gives no control over how those files are mapped to remote
machines. Hence, no one machine necessarily contains all of the data necessary for a given
example.
Apache mahout is a framework for implementing machine learning in mapreduce paradigm.
3 Apache mahout
Apache Mahout is a project of the Apache Software Foundation to produce free implementations
of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 9
collaborative filtering, clustering and classification. Many of the implementations use the
Apache Hadoop platform. Mahout also provides Java libraries for common maths operations
(focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in
progress; the number of implemented algorithms has grown quickly. But various algorithms are
still missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative
filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not
restrict contributions to Hadoop based implementations. Contributions that run on a single node
or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering
recommender component of Mahout was originally a separate project and can run stand-alone
without Hadoop. Integration with initiatives such as the Pregel-like graphs is actively under
discussion.
Mahout Algorithms include many new implementations built for speed on Mahout-Samsara.
They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find
robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative
filtering. The new spark-item similarity enables the next generation of co-occurrence
recommenders that can use entire user click streams and context in making recommendations.
3.1 Mahout installation
Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch
based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce
paradigm. However we do not restrict contributions to Hadoop based implementations:
Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The
core libraries are highly optimized to allow for good performance also for non-distributed
algorithms. Scalable to support your business case. Mahout is distributed under a commercially
friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on the project itself but also on
potential use cases. Come to the mailing lists to find out more. Currently Mahout supports
mainly four use cases: Recommendation mining takes users' behavior and from that tries to find
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 10
items users might like. Clustering takes e.g. text documents and groups them into groups of
topically related documents. Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign unlabeled documents to the
(hopefully) correct category. Frequent item set mining takes a set of item groups (terms in a
query session, shopping cart content) and identifies, which individual items usually appear
together.
3.1.1 Apache Mahout Integration
The flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with
Hadoop is invaluable. Therefore, I have packaged the most recent stable release of Mahout (0.5),
and very excited to work with the Mahout community becoming much more involved with the
project as both Mahout& Hadoop continue to grow.
3.1.2 Why we are packing Mahout with Hadoop
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra,
Analysis of Algorithms, and many other subjects. This field allows us to examine things such as
recommendation engines involving new friends, love interests, and new products. We can do
incredibly advanced analysis around genetic sequencing and examination, distributed search and
frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular
value decomposition (SVD). Apache Mahout is an open source project that is a part of the
Apache Software Foundation, devoted to Machine Learning. Mahout can operate on top of
Hadoop, which allows the user to apply the concept of Machine Learning via a selection of
algorithms in Mahout to distributed computing via Hadoop. Mahout packages popular machine
learning algorithms such as:
 Recommendation mining takes users’ behavior and finds items said specified user might like.
 Clustering, takes e.g. text documents and groups them based on related document topics.
 Classification learns from existing categorized documents what specific category documents
look like and is able to assign unlabeled documents to the appropriate category.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 11
 Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping
cart content) and identifies, which individual items typically appear together.
We are very excited to be working with the Apache Mahout community and highly encourage
everyone who is using CDH currently to give Mahout a try! As always, we are open to any
guests who would like to blog about their experience using Mahout with CDH.
3.2 Installing Mahout
Mahout is an acquisition of highly scalable machine learning algorithms over very large data
sets. Although the real power of Mahout can be vouched for only on large HDFS data, but
Mahout also supports running algorithm on local file system data that can help you get a feel of
how to run Mahout Algorithms. Before you can run any Mahout algorithm you need a Mahout
Installation ready on your Linux machine which can be carried out in two ways as described
below
Step 1:-
We will download mahout-distribution-0.x.tar.gz Download mahout-distribution-
0.x.tar.gz from the Apache Download Mirrors and extract the contents of the Hadoop package to
a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the
files to the hduser user and hadoop group, for example:
cd /usr/local
$ sudo tar xzf mahout-distribution-0.x.tar.gz
$ sudo mv mahout-distribution-0.x.tar.gz to mahout
$ sudo chown -R hduser:hadoop mahout
This should result in a folder with name
/path_to_downloaded_tarball/mahout-distribution-0.x
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 12
Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted
folder. For testing your installation, you can also run
bin/mahout
without any other arguments.
Now we will set the path in the .bashrc file
export MAHOUT_HOME=/usr/local/mahout
path=$path:$MAHOUT_HOME/bin
Step 2:-
Create a directory where you would want to check out the
Mahout code, we’ll call it here MAHOUT_HOME:
$ sudo mkdir -p /app/mahout
$ Sudo chown hduser: hadoop /app/mahout
# ...and if you want to tighten up security, chmod from 755 to 750..
.
$ sudo chmod 750 /app/mahout
Step: 3-
Now we will set Hadoop_confi path in in .hadoop.env.sh
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 13
/usr/local/mahout/lib/*
Step 4:-
Now we have INSTALLATION OF MAVEN.
$ sudo tar xzf apache-maven-2.0.9-bin.tar.gz
$ sudo mv apache-maven-2.0.9-bin.tar.gz maven
$ sudo chown -R hduser:hadoop maven
Now we will set the path in the .bashrc file
export M2_HOME=/usr/local/maven
path=$path:$M2_HOME/bin
Now we can Start Mahout by .bin/mahout command now we have complete mahout
installation. Now we can use mahout with hadoop configuration.
Fig2: showing maven version
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 14
4 Advantages and Disadvantages
Advantages
The paper "Map-Reduce for Machine Learning on Multicore" shows 10 machine learning
algorithms, which can benefit from map reduce model. The key point is "any algorithm fitting
the Statistical Query Model may be written in a certain “summation form.”, and the algorithms
can be expressed as summation form can apply map reduce programming model.
Disadvantages
The MapReduce does not work when there are computational dependencies in the data. This
limitation makes it difficult to represent algorithms that operate on structured models.
As a consequence, when confronted with large scale problems, we often abandon rich structured
models in favor of overly simplistic methods that are amenable to the MapReduce abstraction.
In Machine-learning community, numerous algorithms iteratively transform parameters during
both learning and inference, e.g., Belief Propagation, Expectation Maximization, Gradient
Descent and Gibbs Sampling. Those algorithms iteratively refine a set of parameters until some
termination criteria is matched.
If you invoke MapReduce in each iteration, you still can speed up the computation. The point here is
that we want a better abstraction framework so that it's possible to embrace the graphical structure of
data, to express sophisticated scheduling or automatically assess termination.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 15
6 Conclusions
By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable
gateway to parallelizing machine learning applications. The benefits of easy development
and robust computation did come at a price in terms of performance, however. Furthermore,
proper tuning of the system led to substantial variance in performance.
Defining common settings for machine learning algorithms lead us to discover the core
shortcomings of Hadoop’s MapReduce implementation. We were able to address the most
significant, the need to broadcast data to map tasks. We also greatly improved the
convenience of running MapReduce jobs from within existing Java applications. However,
the ability to tie together the distribution of parallel files on the DFS remains an outstanding
challenge.
All in all, MapReduce represents a promising direction for future machine learning
implementations. When continuing to develop Hadoop and tools that surround it, we must
strive to minimize the compromises between convenience and performance, providing a
platform that allows for efficient processing and rapid application development.
MapReduce for Machine learning seminarreport2015
Dept. of MCA LBSCEK Page 16
5 References
 http://mahout.apache.org/
 https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu-
for-beginners/
 http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html
 https://help.ubuntu.com/community/Java
 http://mahout.apache.org/developers/buildingmahout.html

More Related Content

What's hot

Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesZenodia Charpy
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerDataminingTools Inc
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learningshivani saluja
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenderssscdotopen
 
Pareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrievalPareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrievaljpstudcorner
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones Ido Shilon
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
Journey to learn Machine Learning & Neural Network - Basics
Journey to learn Machine Learning & Neural Network - BasicsJourney to learn Machine Learning & Neural Network - Basics
Journey to learn Machine Learning & Neural Network - BasicsArocom IT Solutions Pvt. Ltd
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21GĂźlden BilgĂźtay
 
Software defect estimation using machine learning algorithms
Software defect estimation using machine learning algorithmsSoftware defect estimation using machine learning algorithms
Software defect estimation using machine learning algorithmsVenkat Projects
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Editor IJCATR
 

What's hot (20)

C3 w2
C3 w2C3 w2
C3 w2
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
RapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid MinerRapidMiner: Data Mining And Rapid Miner
RapidMiner: Data Mining And Rapid Miner
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
C3 w3
C3 w3C3 w3
C3 w3
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
 
Pareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrievalPareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrieval
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
E0322035037
E0322035037E0322035037
E0322035037
 
Journey to learn Machine Learning & Neural Network - Basics
Journey to learn Machine Learning & Neural Network - BasicsJourney to learn Machine Learning & Neural Network - Basics
Journey to learn Machine Learning & Neural Network - Basics
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Software defect estimation using machine learning algorithms
Software defect estimation using machine learning algorithmsSoftware defect estimation using machine learning algorithms
Software defect estimation using machine learning algorithms
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
 

Viewers also liked

Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveEMC
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduceAkhilesh Joshi
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi
 
Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Apigee | Google Cloud
 
Population Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsPopulation Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsFrank Wang
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonRoberto Espinosa
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial IntelligenceZavain Dar
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 

Viewers also liked (20)

Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduce
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 
Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?
 
Population Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsPopulation Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text Analytics
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 

Similar to mapReduce for machine learning

Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine LearningJanani C
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: OverviewGeoffrey Fox
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - CopyAMIT KUMAR
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016 Mahesh Dananjaya
 
Operation's research models
Operation's research modelsOperation's research models
Operation's research modelsAbhinav Kp
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfPhD Assistance
 
251 - Alogarithms Lects.pdf
251 - Alogarithms Lects.pdf251 - Alogarithms Lects.pdf
251 - Alogarithms Lects.pdfAbdulkadir Jibril
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code OptimizationIRJET Journal
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindsporeijdms
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
Aggreagate awareness
Aggreagate awarenessAggreagate awareness
Aggreagate awarenessYogeeswar Reddy
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Sharmila Sathish
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...NECST Lab @ Politecnico di Milano
 

Similar to mapReduce for machine learning (20)

Parallel Machine Learning
Parallel Machine LearningParallel Machine Learning
Parallel Machine Learning
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
MultiObjective(11) - Copy
MultiObjective(11) - CopyMultiObjective(11) - Copy
MultiObjective(11) - Copy
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Operation's research models
Operation's research modelsOperation's research models
Operation's research models
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdf
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
251 - Alogarithms Lects.pdf
251 - Alogarithms Lects.pdf251 - Alogarithms Lects.pdf
251 - Alogarithms Lects.pdf
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
Aggreagate awareness
Aggreagate awarenessAggreagate awareness
Aggreagate awareness
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 

Recently uploaded (20)

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 

mapReduce for machine learning

  • 1. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 1 1 Introduction Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning to hit a power limit as device geometries shrink due to leakage, and simply because CMOS consumes power every time it changes state. Yet Moore’s law, the density of circuits doubling every generation, is projected to last between 10 and 20 more years for silicon based circuits, but doubling the number of processing cores on a chip, one can maintain lower power while doubling the speed of many applications. This has forced an industry-wide shift to multicore. We thus approach an era of increasing numbers of cores per chip, but there is as yet no good frame-work for machine learning to take advantage of massive numbers of cores. There are many parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these approaches make it obvious how to parallelize a particular algorithm. There is a vast literature on distributed learning and data mining, but very little of this literature focuses on our goal: A general means of programming machine learning on multicore. Much of this literature contains a long and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individual learning algorithms, for instance cascaded parallelization technique for machine learning and, more pragmatically, specialized implementations of popular algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et. al. give some general data distribution conditions for parallelizing machine learning, but restrict the focus to decision trees; Jin and Agrawal give a general machine learning programming approach, but only for shared memory machines. This doesn’t fit the architecture of cellular or grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated. In this paper, we focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea of this approach is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations. This paper’s contributions are :(i) We show that any algorithm fitting the Statistical Query Model may be written in a certain “summation form.” This form does not change the underlying algorithm and so is not an approximation, but is instead an exact implementation. (ii) The summation form does not depend on, but can be easily expressed in a map-reduce framework which is easy to program in. (iii) this technique achieves basically linear speed-up with the number of cores.
  • 2. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 2 2 Machine learning “Machine learning” sounds mysterious for most people. Indeed, only a small fraction of professionals really know what it stands for. And there is a serious reason for it – this field is rather technical and difficult to explain to a layman. However, we would like to bridge this gap and explain a bit about what machine learning (ML) is and how it can be used in our everyday life or business. So what is this mysterious ML? Machine learning can refer to: • The branch of artificial intelligence; • The methods used in this field (there are a variety of different approaches). Overall, if talking about the latter, Tom Mitchell, author of the well-known book “Machine learning”, defines ML as “improving performance in some task with experience”. However, this definition is quite a broad one, so we can quote another more specific description stating that ML deals with systems that can learn from data. ML works with data and processes it to discover patterns that can be later used to analyze new data. ML usually relies on specific representation of data, a set of “features” that are understandable for a computer. For example, if we are talking about text it should be represented through the words it contains or some other characteristics such as length of the text, number of emotional words etc. This presentation depends on the task you are dealing with and is typically referred to as “feature extraction”. Types of ML All ML tasks can be classified in several categories, the main ones are: • supervised ML; • Unsupervised ML; • Reinforcement learning.
  • 3. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 3 Now let us explain in simple words the kind of problems that are dealt with by each category. Supervised ML relies on data where the true label/class was indicated. This is easier to explain using an example. Let us imagine that we want to teach a computer to distinguish pictures of cats and dogs. We can ask some of our friends to send us pictures of cats and dogs adding a tag ‘cat’ or ‘dog’. Labeling is usually done by human annotators to ensure a high quality of data. So now we know the true labels of the pictures and can use this data to “supervise” our algorithm in learning the right way to classify images. Once our algorithm learns how to classify images we can use it on new data and predict labels (‘cat’ or ‘dog’ in our case) on previously unseen images. 2.1 Applications to Machine Learning Many standard machine learning algorithms follow one of a few canonical data processing patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks, illuminating the benefits that the MapReduce framework offers to the machine learning community. In this section, we investigate the performance trade-offs of using MapReduce from an algorithm centric perspective, considering in turn three classes of ML algorithms and the issues of adapting each to a MapReduce framework. The performance that results depends intimately on the design choices underlying the MapReduce implementation, and how well those choices support the data processing pattern of the ML algorithm. We conclude this section with a discussion of changes and extensions to the Hadoop MapReduce implementation that would benefit the machine learning community. 2.1.1 A Taxonomy of Standard Machine Learning Algorithms While ML algorithms can be classified on many dimensions, the one we take primary interest in here is that of procedural character: the data processing pattern of the algorithm. Here, we consider single-pass, iterative and query-based learning techniques, along with several example algorithms and applications.
  • 4. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 4 2.1.2 Single-pass Learning Many ML applications make only one pass through a data set, extracting relevant statistics for later use during inference. This relatively simple learning setting arises often in natural language processing, from machine translation to information extraction to spam filtering. These applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of local contributions to the map task, then combining those contributions to compute relevant statistics about the dataset as a whole. Consider the following examples, illustrating common decompositions of these statistics. Estimating Language Model Multinomial: Extracting language models from a large corpus amounts to little more than counting n-grams, though some parameter smoothing over the statistics is also common. The map phase enumerates the n-grams in each training instance (typically a sentence or paragraph), and the reduce function counts instances of n-grams.2This option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this semester. Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes classifier, or any fully observed Bayes net, again requires counting occurrences in the training data. In this case, however, feature extraction is often computation-intensive, perhaps involving small search or optimization problems for each datum. The reduce task, however, remains a summation of each (feature, label) environment pair. Syntactic Translation Modeling: Generating a syntactic model for machine translation is an example of a research-level machine learning application that involves only a single pass through a preprocessed training set. Each training datum consists of a pair of sentences in two languages, an estimated alignment between the words in each, and an estimated syntactic parse tree for one sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves search over these coupled data structures. 2.1.3 Iterative Learning The class of iterative ML algorithms – perhaps the most common within the machine learning research community – can also be expressed within the framework of MapReduce by chaining
  • 5. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 5 together multiple MapReduce tasks. While such algorithms vary widely in the type of operation they perform on each datum (or pair of data) in a training set, they share the common characteristic that a set of parameters is matched to the data set via iterative improvement. The update to these parameters across iterations must again decompose into per-datum contributions, which is the case of the example applications below. As with the examples discussed in the previous section, the reduce function is considerably less compute-intensive than the map tasks. In the examples below, the contribution to parameter updates from each datum (the mapfunction) depends in a meaningful way on the output of the previous iteration. For example, the expectation computation of EM or the inference computation in an SVM or perceptron classifier can reference a large portion or all of the parameters generated by the algorithm. Hence, these parameters must remain available to the map tasks in a distributed environment. The information necessary to compute the map step of each algorithm is described below; the complications that arise because this information is vital to the computation are investigated later in the paper. Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of a training set given a generative model with latent variables. The E-step of the algorithm computes posterior distributions over the latent variables given current model parameters and the observed data. The maximization step adjusts model parameters to maximize the likelihood of the data assuming that latent variables take on their expected values. Projecting onto the MapReduce framework, the map task computes posterior distributions over the latent variables of a datum using current model parameters; the maximization step is performed as a single reduction, which sums the sufficient statistics and normalizes to produce updated parameters. We consider applications for machine translation and speech recognition. For multivariate Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),parameters are also needed to specify the state transition probabilities; the models, efficiently3Generating appropriate training data for this task involves several applications of iterative learning algorithms, described in the following section.
  • 6. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 6 Stored in binary form, occupy tens of megabytes. For word alignment models (e.g., machine translation), these parameters include word-to-word translation probabilities; these can number in the millions, even after pruning heuristics remove the unnecessary parameters. Discriminative Classification and Regression: When fitting model parameters via a perceptron, boosting, or support vector machine algorithm for classification or regression, the map stage of training will involve computing inference over the training example given the current model parameters. Similar to the EM case, a subset of the parameters from the previous iteration must be available for inference. However, the reduce stage typically involves summing over parameter changes. Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical featured setting that often extracts hundreds or thousands of features from each training example, the relevant parameter space needed for inference can be quite large. 2.1.4 Query-based Learning with Distance Metrics Finally, we consider distance-based ML applications that directly reference the training set during inference, such as the nearest-neighbor classifier. In this setting, the training data are the parameters, and a query instance must be compared to each training datum. While it is the case that multiple query instances can be processed simultaneously within a MapReduce implementation of these techniques, the query set must be broadcast to all map tasks. Again, we have a need for the distribution of state information. However, in this case, the query information that must be distributed to all map tasks needn’t be processed concurrently – a query set can be broken up and processed over multiple MapReduce operations. In the examples below, each query instance tends to be of a manageable size. K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query set to each element of a training set, and discovers examples with minimal distances from the queries. The map stage computes distance metrics, while the reduce stage tracks k examples for each label that have minimal distance to the query.
  • 7. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 7 Similarity-based Search Finding the most similar instances to a given query has a similar character, sifting through the training set to find examples that minimize a distance metric. Computing the distance is the map stage, while minimizing it is the reduce stage. 2.2 Performance and Implementation Issues While the algorithms discussed above can all be implemented in parallel using the MapReduce abstraction, our example applications from each category revealed a set of implementation challenges. We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we will address issues related both to the Hadoop implementation of MapReduce and the MapReduce framework itself. 2.2.1 Single-pass Learning The single-pass learning algorithms described in the previous section are clearly amenable to the MapReduce framework. We will focus here on the task of generating a syntactic translation model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic structure 160 140 Local MapReduce 3-node MapReduce 120 Local Reference Seconds 100 80 60 40 20 0 0 10 20 30 40 50 60 Training (Thousands of sentence pairs) Fig1: the benefit of distributed computation quickly outweighs the overhead of a mapReduce implementation on a 3-node cluster Figure1 shows the running times for various input sizes, demonstrating the overhead of running
  • 8. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 8 MapReduce relative to the reference implementation. The cost of running Hadoop cancels out some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed- up of 39% over the reference implementation. The overhead of simulating a MapReduce computation on a single machine was 51% of the compute cost of the reference implementation. Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two machines would give virtually no benefit for the largest data set size we tested. A more promising metric shows that as the size of the data scales, the distributed MapReduce implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of each example by comparing the slopes of the curves in figure 1, which are all near-linear. The reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples, while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000 examples. So, the variable overhead of MapReduce is minimal for this task, while the static overhead of distributing the code base and channeling the processing through Hadoop’s infrastructure is large. We would expect that substantially increasing the size of the training set would accentuate the utility of distributing the computation. Thus far, we have assumed that the training data was already written to Hadoop’s distributed file System (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks of Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus need only be distributed to Hadoop’s file system once, and can be processed by many different applications. On the other hand, this application references four different input sources, including sentences, alignments and parse trees for each example. When copying these resources independently to the DFS, the Hadoop implementation gives no control over how those files are mapped to remote machines. Hence, no one machine necessarily contains all of the data necessary for a given example. Apache mahout is a framework for implementing machine learning in mapreduce paradigm. 3 Apache mahout Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of
  • 9. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 9 collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform. Mahout also provides Java libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly. But various algorithms are still missing. While Mahout's core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not restrict contributions to Hadoop based implementations. Contributions that run on a single node or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop. Integration with initiatives such as the Pregel-like graphs is actively under discussion. Mahout Algorithms include many new implementations built for speed on Mahout-Samsara. They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative filtering. The new spark-item similarity enables the next generation of co-occurrence recommenders that can use entire user click streams and context in making recommendations. 3.1 Mahout installation Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find
  • 10. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 10 items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category. Frequent item set mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together. 3.1.1 Apache Mahout Integration The flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with Hadoop is invaluable. Therefore, I have packaged the most recent stable release of Mahout (0.5), and very excited to work with the Mahout community becoming much more involved with the project as both Mahout& Hadoop continue to grow. 3.1.2 Why we are packing Mahout with Hadoop Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD). Apache Mahout is an open source project that is a part of the Apache Software Foundation, devoted to Machine Learning. Mahout can operate on top of Hadoop, which allows the user to apply the concept of Machine Learning via a selection of algorithms in Mahout to distributed computing via Hadoop. Mahout packages popular machine learning algorithms such as:  Recommendation mining takes users’ behavior and finds items said specified user might like.  Clustering, takes e.g. text documents and groups them based on related document topics.  Classification learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category.
  • 11. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 11  Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together. We are very excited to be working with the Apache Mahout community and highly encourage everyone who is using CDH currently to give Mahout a try! As always, we are open to any guests who would like to blog about their experience using Mahout with CDH. 3.2 Installing Mahout Mahout is an acquisition of highly scalable machine learning algorithms over very large data sets. Although the real power of Mahout can be vouched for only on large HDFS data, but Mahout also supports running algorithm on local file system data that can help you get a feel of how to run Mahout Algorithms. Before you can run any Mahout algorithm you need a Mahout Installation ready on your Linux machine which can be carried out in two ways as described below Step 1:- We will download mahout-distribution-0.x.tar.gz Download mahout-distribution- 0.x.tar.gz from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example: cd /usr/local $ sudo tar xzf mahout-distribution-0.x.tar.gz $ sudo mv mahout-distribution-0.x.tar.gz to mahout $ sudo chown -R hduser:hadoop mahout This should result in a folder with name /path_to_downloaded_tarball/mahout-distribution-0.x
  • 12. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 12 Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted folder. For testing your installation, you can also run bin/mahout without any other arguments. Now we will set the path in the .bashrc file export MAHOUT_HOME=/usr/local/mahout path=$path:$MAHOUT_HOME/bin Step 2:- Create a directory where you would want to check out the Mahout code, we’ll call it here MAHOUT_HOME: $ sudo mkdir -p /app/mahout $ Sudo chown hduser: hadoop /app/mahout # ...and if you want to tighten up security, chmod from 755 to 750.. . $ sudo chmod 750 /app/mahout Step: 3- Now we will set Hadoop_confi path in in .hadoop.env.sh
  • 13. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 13 /usr/local/mahout/lib/* Step 4:- Now we have INSTALLATION OF MAVEN. $ sudo tar xzf apache-maven-2.0.9-bin.tar.gz $ sudo mv apache-maven-2.0.9-bin.tar.gz maven $ sudo chown -R hduser:hadoop maven Now we will set the path in the .bashrc file export M2_HOME=/usr/local/maven path=$path:$M2_HOME/bin Now we can Start Mahout by .bin/mahout command now we have complete mahout installation. Now we can use mahout with hadoop configuration. Fig2: showing maven version
  • 14. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 14 4 Advantages and Disadvantages Advantages The paper "Map-Reduce for Machine Learning on Multicore" shows 10 machine learning algorithms, which can benefit from map reduce model. The key point is "any algorithm fitting the Statistical Query Model may be written in a certain “summation form.”, and the algorithms can be expressed as summation form can apply map reduce programming model. Disadvantages The MapReduce does not work when there are computational dependencies in the data. This limitation makes it difficult to represent algorithms that operate on structured models. As a consequence, when confronted with large scale problems, we often abandon rich structured models in favor of overly simplistic methods that are amenable to the MapReduce abstraction. In Machine-learning community, numerous algorithms iteratively transform parameters during both learning and inference, e.g., Belief Propagation, Expectation Maximization, Gradient Descent and Gibbs Sampling. Those algorithms iteratively refine a set of parameters until some termination criteria is matched. If you invoke MapReduce in each iteration, you still can speed up the computation. The point here is that we want a better abstraction framework so that it's possible to embrace the graphical structure of data, to express sophisticated scheduling or automatically assess termination.
  • 15. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 15 6 Conclusions By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable gateway to parallelizing machine learning applications. The benefits of easy development and robust computation did come at a price in terms of performance, however. Furthermore, proper tuning of the system led to substantial variance in performance. Defining common settings for machine learning algorithms lead us to discover the core shortcomings of Hadoop’s MapReduce implementation. We were able to address the most significant, the need to broadcast data to map tasks. We also greatly improved the convenience of running MapReduce jobs from within existing Java applications. However, the ability to tie together the distribution of parallel files on the DFS remains an outstanding challenge. All in all, MapReduce represents a promising direction for future machine learning implementations. When continuing to develop Hadoop and tools that surround it, we must strive to minimize the compromises between convenience and performance, providing a platform that allows for efficient processing and rapid application development.
  • 16. MapReduce for Machine learning seminarreport2015 Dept. of MCA LBSCEK Page 16 5 References  http://mahout.apache.org/  https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu- for-beginners/  http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html  https://help.ubuntu.com/community/Java  http://mahout.apache.org/developers/buildingmahout.html