1. Course Structure
Slide 2 www.edureka.in/mahout
 Module 1:
Introduction to Machine Learning
and Apache Mahout
 Module 2:
Mahout and Hadoop
 Module 3:
 Module 4:
Implementing a Recommender and
 Module 5:
 Module 6:
 Module 7:
Mahout and Amazon EMR
 Module 8:
2. How it Works?
Slide 3 www.edureka.in/mahout
 Live Classes
 Class Recordings
 Module wise Quizzes, Coding Assignments
 24x7 on-demand Technical Support
 Sample Application and Live Project
 Online Certification Exam
 Lifetime access to the Learning Management System
3. Module 1
Slide 4 www.edureka.in/mahout
 Mahout Overview
 ML Common Use Cases
 Algorithms in Mahout
 Mahout Commercial Use
 Mahout Summary
 Supervised and Unsupervised Learning
 Introduction of Clustering and Classification
 Similarity Metrics
 Similarity by correlation
 Similarity by distance
 Distance Measure Types
4. Mahout Overview
 Mahout began life in 2008 as a subproject of Apache’s Lucene project, which provides the well-known
open source search engine of the same name.
 Lucene provides advanced implementations of search, text mining, and information-retrieval
 In the universe of computer science, these concepts are adjacent to machine learning techniques like
clustering and, to an extent, classification.
 As a result, some of the work of the Lucene
committers that fell more into these machine
learning areas was spun off into its own subproject.
 Soon after, Mahout absorbed the Taste open source
collaborative filtering project.
Slide 5 www.edureka.in/mahout
5. Mahout Overview
Apache Mahout and its related projects within the Apache Software Foundation
Slide 6 www.edureka.in/mahout
6. What are we going to learn today?
What is machine learning
Can a Machine learn
How to do it
Slide 7 www.edureka.in/mahout
7. Mahout : Scalable Machine learning Library
Machine Learning is a Programming Computers to optimize a Performance Criterion
using Example Data or Past experience
 Machine learning – what does it mean?
 A branch of artificial intelligence
 Systems that learn from data
 Classify data after learning
 Learn on test data sets
 Generalisation – the ability to classify unseen data sets
Slide 8 www.edureka.in/mahout
8. What is Mahout?
Slide 9 www.edureka.in/mahout
 Apache Mahout is an Apache project to produce open source
implementations of distributed or otherwise scalable machine learning
algorithms focused primarily in the areas of collaborative filtering,
clustering and classification, often leveraging, but not limited to, the
 The Apache Mahout project aims to make building intelligent
applications easier and faster. Mahout co-founder Grant Ingersoll
introduces the basic concepts of machine learning and then
demonstrates how to use Mahout to cluster documents, make
recommendations, and organize content.
Three specific machine-learning tasks that Mahout currently
 Collaborative Filtering
9.  Machine Learning is a class of algorithms which is data-driven, i.e.
unlike "normal" algorithms it is the data that "tells" what the "good
An hypothetical non-machine learning algorithm for face recognition in
images would try to define what a face is (round skin-like-colored disk,
with dark area where you expect the eyes etc).
A machine learning algorithm would not have such coded definition,
"learn-by-examples": you'll show several images of faces and not-
faces and a good algorithm will eventually learn and be able to predict
whether or not an unseen image is a face.
What is Mahout (Cont’d).
Slide 10 www.edureka.in/mahout
10. Mahout – How does it work?
Supports four Use
Has many supplied
Recommendation mining Clustering Classification Fixing Item Set Mining
Slide 11 www.edureka.in/mahout
12. Mahout Overview
ML is all over the web today
Mahout has functionality for many
of today’s common machine
MapReduce magic in action
Mahout is about scalable
Slide 13 www.edureka.in/mahout
13. Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 14 www.edureka.in/mahout
14. Annie’s Question
Q: What is Machine Learning?
Slide 15 www.edureka.in/mahout
15. Annie’s answer
A: Machine Learning is a branch of Artificial Intelligence.
Training the machine through data in such a way that the
machines can simulate human like decisions - TRUE
Slide 16 www.edureka.in/mahout
16. Q: Is Statistical Modeling equivalent to Machine Learning?
Slide 17 www.edureka.in/mahout
17. Answer: NO
Reason:- Statistical modeling is a way to identify the
relationships between variables through mathematical
equations. In statistical modeling the relation between the
variables is NOT deterministic rather Stochastic.
Whereas Machine learning uses Statistical modeling to train
the system and generate the model/functions
Slide 18 www.edureka.in/mahout
18. utilizes recommendation
systems to bring videos to a user
that it believes the user will be
They are designed to:
 Increase the numbers of
videos the user will watch
 Increase the length of time he
spends on the site, and
 Maximize the enjoyment of his
Machine Learning Use Cases – You Tube
Slide 19 www.edureka.in/mahout
19. User Activity:
In order to obtain personalized recommendations, YouTube's recommendation system combines the
related videos association rules with the user's personal activity on the site.
This includes several factors:
 There are the videos that were watched - along with a certain threshold, say by a certain date.
After all, you don't want to count videos watched from 2 years ago if the user has watched enough
videos, most likely.
 Also, YouTube factors in with emphasis any videos that were explicitly "liked", added to favourites,
given a rating, added to a playlist. The union of these videos is known as the seed set.
 Then, to compute the candidate recommendations for a seed set, YouTube expands it along the
Use Case – You Tube (Contd.)
Slide 20 www.edureka.in/mahout
20. Use Case – Wine Recommendation
Slide 21 www.edureka.in/mahout
21. Use Case – Wine Recommendation (Contd)
What wine will I enjoy? More than 2 million consumers turn
to the Internet for the answer to this question every day
• Mysterious ratings and adjective-based reviews do little
to help consumers decide which wine to buy
• They can't even agree amongst themselves
• Next Glass solves this problem by removing subjectivity
and applying science to deliver recommendations
based on your previous ratings
Slide 22 www.edureka.in/mahout
22. Use Case - Biometrics
Biometrics : The Science of establishing the identity
of an individual based on the physical, chemical or
behavioral attributes of the person.
Why is it Important?
 Identify Individual credentials
 Identify and prevent banking fraud
 Enforcement of law and security
Slide 23 www.edureka.in/mahout
23. How Does a Fingerprint Optical Scanner Work?
A fingerprint scanner system has two basic jobs
 Get an image of your finger
 Determine whether the pattern of ridges and valleys in this image matches the
pattern of ridges and valleys in pre-scanned images
 Only specific characteristics, which are unique to every fingerprint, are filtered are
saved as an encrypted biometric key or mathematical representation.
 No image of a fingerprint is ever saved, only a series of numbers (a binary code),
which is used for verification. The algorithm cannot be reconverted to an image, so
no one can duplicate your fingerprints
Slide 24 www.edureka.in/mahout
24. Use Case – Aadhaar
India is reportedly creating a biometric database to hold the fingerprints and face images for each
of 1.2 Billion citizens as part of its Unique Identification Project.
Slide 25 www.edureka.in/mahout
25. Use Case – Paycheck Secure System
All Trust Network Paycheck Secure System has enrolled over 6 Million users and over 70 Million
Slide 26 www.edureka.in/mahout
26. Computer Vision
Stimulate reality: Generate
complex, physically realistic
maintaining precise control over
Apply rigorous computational
principles to develop theories of
human visual perception
Create perceptually inspired “short
cuts” to increase efficiency, or
achieve advanced effects
Analysis for synthesis:
Application of segmentation,
learning, etc. to rendering and
Imitate design principles of
biological systems to solve under-
constrained vision problems
Test vision algorithms on
computer generated images for
which all scene parameters are
Slide 27 www.edureka.in/mahout
27. Learning Techniques
Attain knowledge by study, experience, or by being taught.
Types of Learning
Slide 28 www.edureka.in/mahout
28. Supervised Learning
Slide 29 www.edureka.in/mahout
Supervised learning : Training data includes both the input and the desired results.
 For some examples, the correct results (targets) are known and are given in input to the model
during the learning process.
 The construction of a proper training, validation and test set (Bok) is crucial.
 These methods are usually fast and accurate.
 Have to be able to generalize: give the correct results when new data are given in input without
knowing a priori the target.
29. Example – Supervised Learning Model
Slide 30 www.edureka.in/mahout
31. Unsupervised learning
 The model is not provided with the correct results during the training.
 Can be used to cluster the input data in classes on the basis of their statistical properties only
Cluster significance and labeling.
 The labeling can be carried out even if the labels are only available for a small number of objects
representative of the desired classes.
Slide 32 www.edureka.in/mahout
32. Example – Unsupervised Learning Model
Cluster ID or
Slide 33 www.edureka.in/mahout
34. Annie’s Question
Q: What is true about Supervised learning and Unsupervised Learning
A)Supervised learning is creating model/function through labeled training data
B)Unsupervised learning is a way to find unknown groups in un-labeled
C)In Supervised learning the input observations contain the input vector and
the target variable (also called as label)
D) Unsupervised learning the input observations contain only input vector
E) All of the above
Slide 35 www.edureka.in/mahout
35. Annie’s Answer
A: All of the above
Slide 36 www.edureka.in/mahout
36. Q: K-means clustering algorithm fall under supervised
learning or un-supervised learning techniques
Slide 37 www.edureka.in/mahout
37. A: Unsupervised learning technique, as the input dataset
will NOT have the labels (target variable) and allow users to
infer hidden groups with in the input datasets
Slide 38 www.edureka.in/mahout
38. Mahout Use Cases
Slide 39 www.edureka.in/mahout
39. Top-level packages define the Mahout interfaces to these key abstractions:
Slide 40 www.edureka.in/mahout
A vector is a quantity or phenomenon that has two indepen
properties: magnitude and direction.
The term also denotes the mathematical or geometrical
representation of such a quantity.
Vector R A
R2= A2 + B2
tan ϴ = A/B
Slide 41 www.edureka.in/mahout
41. Visualizing Vectors
In two dimensions, vectors are represented as
an ordered list of values, one for eachdimension,
like (4, 3). Both representations are illustrated in
We often name the first dimension x and the
second y when dealing with two dimensions,
but this won’t matter for our purposes in
As far as we’re concerned, a vector
can have 2, 3, or 10,000 dimensions. The first is
dimension 0, the next is dimension 1,
and so on.
Slide 42 www.edureka.in/mahout
42. Vectors implementation in Mahout
Dense Vector Sequential Access
It can be thought of as an
array of doubles, whose
size is the number of
features in the data.
Because all the entries in
the array are preallocated
regardless of whether the
value is 0 or not, we call it
It is implemented as a
HashMap between an
integer and a double,
where only nonzero valued
features are allocated.
Hence, they’re called as
It is implemented as two
parallel arrays, one of
integers and the other of
doubles. Only nonzero valued
entries are kept in it.
se Vector, which is optimized
for random access , this one
is optimized for linear
Slide 43 www.edureka.in/mahout
43. Similarity Measurement
Similarity measurement definition
Similarity by Correlation Similarity by Distance
Slide 44 www.edureka.in/mahout
45. Euclidean distance measure
The Euclidean distance is the
simplest of all distance
It’s the most intuitive and
matches our normal idea of
For example, given two
points on a plane, the
Euclidean distance measure
could be calculated by using
a ruler to measure the
distance between them.
distance between two n-
dimensional vectors (a1, a2,
... , an) and (b1,b2, ... , bn)
The Mahout class that
implements this measure is
Euclidean Distance Measure.
Slide 46 www.edureka.in/mahout
46. Squared Euclidean distance measure
Slide 47 www.edureka.in/mahout
 Just as the name suggests, this distance measure’s value is the square of the value
 Returned by the Euclidean distance measure.
 For n-dimensional vectors (a1, a2, ... , an) and (b1, b2, ... ,bn) the distance becomes
d = (a1 – b1)2 + (a2 – b2)2 + ... + (an – bn)2
 The Mahout class that implements this measure is Squared Euclidean Distance Measure
47. Manhatten distance measure
Slide 48 www.edureka.in/mahout
 The distance between any two points is the sum of the absolute differences of their coordinates
 Mathematically, the Manhattan distance between two n-dimensional vectors (a1, a2, ... , an)
and (b1, b2, ... , bn) is
d = |a1 – b1| + |a2 – b2| + ... + |an – bn|
 The Mahout class that implements this measure is ManhattanDistanceMeasure.
48. Difference between Euclidean and Manhattan
From this image we can say that, The Euclidean distance measure gives 5.65 as the distance
between (2, 2) and (6, 6) whereas the Manhattan distance is 8.0
Slide 49 www.edureka.in/mahout
49. Cosine distance measure
 The cosine distance measure requires us to again think of points as vectors from the origin to
 These vectors form an angle, θ, between them, When this angle is small, the vectors must be
pointing in somewhat the same direction, and so in some sense the points are close.
 The cosine of this angle is near 1 when the angle is small, and decreases as it gets larger. The
cosine distance equation subtracts the cosine value from 1 in order to give a proper distance,
which is 0 when close and larger otherwise.
 The formula for the cosine distance between n-dimensional vectors (a1, a2, ... , an) and
(b1, b2, ... ,bn) is
Slide 50 www.edureka.in/mahout
50. Cosine distance measure
Cosine angle between the
vectors (2, 3) and (4, 1), as
calculated from the origin
Slide 51 www.edureka.in/mahout
 This measure of distance doesn’t account for the
length of the two vectors; all that matters is that
the points are in the same direction from the
 The cosine distance measure ranges from 0.0
(two vectors along the same direction) to 2.0
(two vectors in opposite directions).
 The Mahout class that implements this measure
 The cosine distance measure disregards the
lengths of the vectors. This may work well for
some data sets, but it’ll lead to poor clustering in
others where the relative lengths
of the vectors contain valuable information.
51. Cosine distance measure
 The Tanimoto distance measure, also known as Jaccard’s distance measure, captures the
information about the angle and the relative distance between the points.
 The formula for the Tanimoto distance between two n-dimensional vectors (a1, a2, ... ,
an) and (b1, b2, ... , bn) is
Slide 52 www.edureka.in/mahout
52. Annie’s Question
Q: What is a Vector?
A) A vector has both magnitude and direction
B) A vector has only magnitude but not direction
C) A vector will NOT have magnitude but has only direction
D) None of the above
Slide 53 www.edureka.in/mahout
53. Annie’s Answer
A: A vector has both magnitude and direction
Slide 54 www.edureka.in/mahout
54. Q:Which one of the following vectors need more storage
A) Dense Vector
B) Sparse Vector (Random Access Sparse Vector/
Sequential Access Sparse Vector)
Slide 55 www.edureka.in/mahout
55. A: Dense Vector as regardless of presence of value of a
variable, the variables will get pre-allocated in the array
Slide 56 www.edureka.in/mahout
56. Q: What are the valid distance measures in the following
org.apache.mahout.common.distance.DistanceMeasureInterface in Mahout
B) Tanimoto Distance Measure
C) Manhatten Distance Measure
D) Cosine Distance Measure
E) Euclidean Distance Measure
F) Squared Eucliden Distance Measure
G) All of the above
Slide 57 www.edureka.in/mahout
57. A: All of the above
Slide 58 www.edureka.in/mahout
Its Your task
1. Install and setup Hadoop in
the Cloudera VM
2. Go through Java Essentials
3. Install and setup Myrrix
4. Install and setup Spark
Slide 59 www.edureka.in/mahout
Slide 60 www.edureka.in/mahout
Mahout API : https://builds.apache.org/job/Mahout-Quality/javadoc/
Apache Mahout : http://mahout.apache.org/
Mahout in Action : http://www.flipkart.com/mahout-in-action/p/itmdyndgwrbr7krj
Slide 61 www.edureka.in/mahout
1)Review Hadoop configuration files
d) Masters and slaves
2)Understand the differences between Mahout and Spark
3) Prepare the basics of Spark and MLib
4)Go through the basics of Myrrix Recommender Engine,