More Related Content
Similar to Big Data Analytics is not something which was just invented yesterday! (20)
More from bhargavi804095 (16)
Big Data Analytics is not something which was just invented yesterday!
- 1. Lesson 17
Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
1
- 2. Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
2
• Fast and Efficient Processing of Big
Data
• Processes very large datasets at the
cluster of machines with high efficiency
for above 10 M data points in shared-
nothing environment.
• Big Data analysis: datasets with over 1
million data points
- 3. Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
3
• Can run on multiple machines as well as
multi-core processing units
• Compute tasks fast when distributed
over the cloud, tasks runs parallel, and
run in shared nothing-computational
environment of multiple machines and
multi-core units
- 4. Mahout in sequential shared
environment
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
4
• Higher time efficiency for less than 1 M
data points.
- 5. Mahout Eclipse and Maven
Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
5
• Download Maven and Eclipse
https://www.eclipse.org/downloads/
Add -Name: m2eclipse-Location: http://
download.eclipse.org/technology/m2e/rele
ases (Google “install m2eclipse”)
• Ubuntu: sudo apt-get install git
sudo apt-get install maven
- 6. Mahout Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
6
• http://mahout.apache.org/general/
downloads.html
Latest Release: 0.13.0 - mahout-
distribution-0.13.tar.gz
• Apache Preparing for Mahout version
0.14.0 (June 2018)
- 7. Git Hub
• GitHub: Mahout Mirror
https://github.com/apache/mahout
• git://git.apache.org/mahout.git mahout-
trunk
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
7
- 8. Git Hub Maven under mahout-trunk
folder:
• <dependency>
• <groupId>org.apache.mahout</groupId>
• <artifactId>mahout-core</artifactId>
• <version>${mahout.version}</version>
• </dependency>
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
8
- 9. Mahout Environment Setting
• export
MAHOUT_HOME=/path/to/mahout
• export MAHOUT_LOCAL=true # for
running standalone on your dev
machine,
• # unset MAHOUT_LOCAL for running
on a cluster
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
9
- 10. Feature in Mahout 0.13.0 (2017)
• Enables easier implementations of most
modern machine learning and deep
learning algorithms
• Open-source distributed scalable linear
algebra library ViennaCL, the Java
wrapper library interface JavaCPP
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
10
- 11. Feature in Mahout 0.13.0 (2017)
• The graphics processor manufacturer,
NVIDIA CUDA bindings directly into
Mahout
• Easier to run matrix mathematics on
graphics cards (used in computers for
fast graphic computations)
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
11
- 12. Java Implementations
• Java libraries for common mathematics
and statistical operations (focused on
linear algebra) and primitive Java
Collection Interfaces
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
12
- 13. Figure 6.21 Mahout architecture
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
13
- 14. Features of Mahout
• Designed on top of Apache Hadoop
• Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
14
- 15. Features of Mahout
• A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
• Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce
• Distributed row matrix (DRM), scalable
libraries for matrix and vectors
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
15
- 16. Random Access Feature
• Random access means accessing vectors
using key, index or hash followed by
values, in any order
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
16
- 17. Clustering Implementations
• Contains several Spark and MapReduce
enabled, such as k-means, fuzzy k-
means, Canopy, Dirichlet and mean
shift, Latent Dirichlet Allocation,
• Spectral Clustering, MultiHash
Clustering
• Hierarchical clustering
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
17
- 18. Classification Implementations
• Single Machine Sequential: Stochastic
Gradient Descent (SGD), Logistic
Regression trained via SGD, Hidden
Markov, Multi-layer Perceptron
• Parallel Distributed (Map Reduce):
• Naïve Bayes, complementary Naïve
Bayes and Random Forest
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
18
- 19. Collaborative Filtering
• Enables making automatic predictions
about items and item sets of interests
• Similar item-sets mining
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
19
- 20. SVD++ and SGD
• Weighted matrix factorization SVD++,
and parallel SGD (in sequentially in
shared-data environment)
• SGD [an iterative learning algorithm in
which each training example is used to
pull the model M program slightly to
reach more closer to correct answer]
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
20
- 21. Recommender Implementation
• Collaborative filtering based
recommender,
• SVD recommender
• KNN-item-based recommender (linear
interpolation item based recommender)
• Cluster-based recommender
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
21
- 22. Implementation APIs
• APIs for distributed and in-core first
and second moment routines
• Distributed and local principal
component analysis (DSPCA and
SPCA) and stochastic singular value
decomposition (DSSVD and SSVD),
singular value decomposition (SVD),
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
22
- 23. Implementation APIs
• Distributed Cholesky QR (thinQR)
• Distributed regularized Alternating
Least Squares (DALS)
• Matrix factorization with alternating
least squares
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
23
- 24. Principal Component Analysis (PCA)
• Means a linear transformation method
• Finds the directions of maximum
variance in high-dimensional data
• Projects those data for transformation
onto a smaller dimensional subspace
while retaining most of the information
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
24
- 25. Principal Component Analysis (PCA)
• Identifies patterns in data sets
• Detect the correlation among the
variables
• Strong correlation: Try for reducing the
dimensionality
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
25
- 26. PCA Applications
• PCA applies in a number of use cases,
such as stock market predictions, and
the analysis of gene expression data.
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
26
- 27. 2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
27
Summary
We learnt:
• Mahout for Fast and Efficient
Processing of Big Data
• A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
• Mahout implementations of Machine
Learning Algorithms
•
- 28. 2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
28
Summary
We learnt:
• Mahout for Similar item sets, Clustering,
Classification, Recommender
• PCA, SVD++, SGD
• Collaborative Filtering
•
- 29. 2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
29
End of Lesson 17 on
Mahout