SlideShare a Scribd company logo
1 of 29
Download to read offline
Lesson 17
Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
1
Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
2
• Fast and Efficient Processing of Big
Data
• Processes very large datasets at the
cluster of machines with high efficiency
for above 10 M data points in shared-
nothing environment.
• Big Data analysis: datasets with over 1
million data points
Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
3
• Can run on multiple machines as well as
multi-core processing units
• Compute tasks fast when distributed
over the cloud, tasks runs parallel, and
run in shared nothing-computational
environment of multiple machines and
multi-core units
Mahout in sequential shared
environment
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
4
• Higher time efficiency for less than 1 M
data points.
Mahout Eclipse and Maven
Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
5
• Download Maven and Eclipse
https://www.eclipse.org/downloads/
Add -Name: m2eclipse-Location: http://
download.eclipse.org/technology/m2e/rele
ases (Google “install m2eclipse”)
• Ubuntu: sudo apt-get install git
sudo apt-get install maven
Mahout Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
6
• http://mahout.apache.org/general/
downloads.html
Latest Release: 0.13.0 - mahout-
distribution-0.13.tar.gz
• Apache Preparing for Mahout version
0.14.0 (June 2018)
Git Hub
• GitHub: Mahout Mirror
https://github.com/apache/mahout
• git://git.apache.org/mahout.git mahout-
trunk
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
7
Git Hub Maven under mahout-trunk
folder:
• <dependency>
• <groupId>org.apache.mahout</groupId>
• <artifactId>mahout-core</artifactId>
• <version>${mahout.version}</version>
• </dependency>
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
8
Mahout Environment Setting
• export
MAHOUT_HOME=/path/to/mahout
• export MAHOUT_LOCAL=true # for
running standalone on your dev
machine,
• # unset MAHOUT_LOCAL for running
on a cluster
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
9
Feature in Mahout 0.13.0 (2017)
• Enables easier implementations of most
modern machine learning and deep
learning algorithms
• Open-source distributed scalable linear
algebra library ViennaCL, the Java
wrapper library interface JavaCPP
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
10
Feature in Mahout 0.13.0 (2017)
• The graphics processor manufacturer,
NVIDIA CUDA bindings directly into
Mahout
• Easier to run matrix mathematics on
graphics cards (used in computers for
fast graphic computations)
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
11
Java Implementations
• Java libraries for common mathematics
and statistical operations (focused on
linear algebra) and primitive Java
Collection Interfaces
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
12
Figure 6.21 Mahout architecture
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
13
Features of Mahout
• Designed on top of Apache Hadoop
• Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
14
Features of Mahout
• A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
• Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce
• Distributed row matrix (DRM), scalable
libraries for matrix and vectors
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
15
Random Access Feature
• Random access means accessing vectors
using key, index or hash followed by
values, in any order
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
16
Clustering Implementations
• Contains several Spark and MapReduce
enabled, such as k-means, fuzzy k-
means, Canopy, Dirichlet and mean
shift, Latent Dirichlet Allocation,
• Spectral Clustering, MultiHash
Clustering
• Hierarchical clustering
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
17
Classification Implementations
• Single Machine Sequential: Stochastic
Gradient Descent (SGD), Logistic
Regression trained via SGD, Hidden
Markov, Multi-layer Perceptron
• Parallel Distributed (Map Reduce):
• Naïve Bayes, complementary Naïve
Bayes and Random Forest
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
18
Collaborative Filtering
• Enables making automatic predictions
about items and item sets of interests
• Similar item-sets mining
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
19
SVD++ and SGD
• Weighted matrix factorization SVD++,
and parallel SGD (in sequentially in
shared-data environment)
• SGD [an iterative learning algorithm in
which each training example is used to
pull the model M program slightly to
reach more closer to correct answer]
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
20
Recommender Implementation
• Collaborative filtering based
recommender,
• SVD recommender
• KNN-item-based recommender (linear
interpolation item based recommender)
• Cluster-based recommender
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
21
Implementation APIs
• APIs for distributed and in-core first
and second moment routines
• Distributed and local principal
component analysis (DSPCA and
SPCA) and stochastic singular value
decomposition (DSSVD and SSVD),
singular value decomposition (SVD),
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
22
Implementation APIs
• Distributed Cholesky QR (thinQR)
• Distributed regularized Alternating
Least Squares (DALS)
• Matrix factorization with alternating
least squares
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
23
Principal Component Analysis (PCA)
• Means a linear transformation method
• Finds the directions of maximum
variance in high-dimensional data
• Projects those data for transformation
onto a smaller dimensional subspace
while retaining most of the information
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
24
Principal Component Analysis (PCA)
• Identifies patterns in data sets
• Detect the correlation among the
variables
• Strong correlation: Try for reducing the
dimensionality
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
25
PCA Applications
• PCA applies in a number of use cases,
such as stock market predictions, and
the analysis of gene expression data.
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
26
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
27
Summary
We learnt:
• Mahout for Fast and Efficient
Processing of Big Data
• A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
• Mahout implementations of Machine
Learning Algorithms
•
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
28
Summary
We learnt:
• Mahout for Similar item sets, Clustering,
Classification, Recommender
• PCA, SVD++, SGD
• Collaborative Filtering
•
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
29
End of Lesson 17 on
Mahout

More Related Content

Similar to Big Data Analytics is not something which was just invented yesterday!

Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
Jiaqi Xie
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
Vinoth Kannan
 

Similar to Big Data Analytics is not something which was just invented yesterday! (20)

An introduction to Machine Learning with scikit-learn (October 2018)
An introduction to Machine Learning with scikit-learn (October 2018)An introduction to Machine Learning with scikit-learn (October 2018)
An introduction to Machine Learning with scikit-learn (October 2018)
 
big data
big databig data
big data
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Machinelearning: The next step in manufacturing performance
Machinelearning: The next step in manufacturing performance Machinelearning: The next step in manufacturing performance
Machinelearning: The next step in manufacturing performance
 
Smart Data Module 6 d drive the future
Smart Data Module 6 d drive the futureSmart Data Module 6 d drive the future
Smart Data Module 6 d drive the future
 
Scaling Up Presentation
Scaling Up PresentationScaling Up Presentation
Scaling Up Presentation
 
QuSandbox+NVIDIA Rapids
QuSandbox+NVIDIA RapidsQuSandbox+NVIDIA Rapids
QuSandbox+NVIDIA Rapids
 
Introducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech TalksIntroducing Amazon SageMaker - AWS Online Tech Talks
Introducing Amazon SageMaker - AWS Online Tech Talks
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial models
 
Automated machine learning - Global AI night 2019
Automated machine learning - Global AI night 2019Automated machine learning - Global AI night 2019
Automated machine learning - Global AI night 2019
 
Ai in finance
Ai in financeAi in finance
Ai in finance
 
Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Synthetic data in finance
Synthetic data in financeSynthetic data in finance
Synthetic data in finance
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 

More from bhargavi804095

Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point ppt
bhargavi804095
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
bhargavi804095
 

More from bhargavi804095 (16)

Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...
 
A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...A File is a collection of data stored in the secondary memory. So far data wa...
A File is a collection of data stored in the secondary memory. So far data wa...
 
C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...C++ helps you to format the I/O operations like determining the number of dig...
C++ helps you to format the I/O operations like determining the number of dig...
 
While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...While writing program in any language, you need to use various variables to s...
While writing program in any language, you need to use various variables to s...
 
Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...Python is a high-level, general-purpose programming language. Its design phil...
Python is a high-level, general-purpose programming language. Its design phil...
 
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
power point presentation to show oops with python.pptx
power point presentation to show oops with python.pptxpower point presentation to show oops with python.pptx
power point presentation to show oops with python.pptx
 
Lecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaionLecture4_Method_overloading power point presentaion
Lecture4_Method_overloading power point presentaion
 
Lecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentationLecture5_Method_overloading_Final power point presentation
Lecture5_Method_overloading_Final power point presentation
 
power point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions conceptspower point presentation on object oriented programming functions concepts
power point presentation on object oriented programming functions concepts
 
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF CTHE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF C
 
Chapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point pptChapter24.pptx big data systems power point ppt
Chapter24.pptx big data systems power point ppt
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 

Recently uploaded

Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
BalamuruganV28
 

Recently uploaded (20)

Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Low Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s HandbookLow Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s Handbook
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 

Big Data Analytics is not something which was just invented yesterday!

  • 1. Lesson 17 Mahout 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 1
  • 2. Mahout 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 2 • Fast and Efficient Processing of Big Data • Processes very large datasets at the cluster of machines with high efficiency for above 10 M data points in shared- nothing environment. • Big Data analysis: datasets with over 1 million data points
  • 3. Mahout 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 3 • Can run on multiple machines as well as multi-core processing units • Compute tasks fast when distributed over the cloud, tasks runs parallel, and run in shared nothing-computational environment of multiple machines and multi-core units
  • 4. Mahout in sequential shared environment 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 4 • Higher time efficiency for less than 1 M data points.
  • 5. Mahout Eclipse and Maven Installing 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 5 • Download Maven and Eclipse https://www.eclipse.org/downloads/ Add -Name: m2eclipse-Location: http:// download.eclipse.org/technology/m2e/rele ases (Google “install m2eclipse”) • Ubuntu: sudo apt-get install git sudo apt-get install maven
  • 6. Mahout Installing 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 6 • http://mahout.apache.org/general/ downloads.html Latest Release: 0.13.0 - mahout- distribution-0.13.tar.gz • Apache Preparing for Mahout version 0.14.0 (June 2018)
  • 7. Git Hub • GitHub: Mahout Mirror https://github.com/apache/mahout • git://git.apache.org/mahout.git mahout- trunk 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 7
  • 8. Git Hub Maven under mahout-trunk folder: • <dependency> • <groupId>org.apache.mahout</groupId> • <artifactId>mahout-core</artifactId> • <version>${mahout.version}</version> • </dependency> 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 8
  • 9. Mahout Environment Setting • export MAHOUT_HOME=/path/to/mahout • export MAHOUT_LOCAL=true # for running standalone on your dev machine, • # unset MAHOUT_LOCAL for running on a cluster 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 9
  • 10. Feature in Mahout 0.13.0 (2017) • Enables easier implementations of most modern machine learning and deep learning algorithms • Open-source distributed scalable linear algebra library ViennaCL, the Java wrapper library interface JavaCPP 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 10
  • 11. Feature in Mahout 0.13.0 (2017) • The graphics processor manufacturer, NVIDIA CUDA bindings directly into Mahout • Easier to run matrix mathematics on graphics cards (used in computers for fast graphic computations) 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 11
  • 12. Java Implementations • Java libraries for common mathematics and statistical operations (focused on linear algebra) and primitive Java Collection Interfaces 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 12
  • 13. Figure 6.21 Mahout architecture 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 13
  • 14. Features of Mahout • Designed on top of Apache Hadoop • Supports algebraic platforms like fast computing Apache Spark paradigm and MapReduce 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 14
  • 15. Features of Mahout • A scalable generalized tensor and linear algebra solving engine, designed on top of Apache Hadoop. • Supports algebraic platforms like fast computing Apache Spark paradigm and MapReduce • Distributed row matrix (DRM), scalable libraries for matrix and vectors 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 15
  • 16. Random Access Feature • Random access means accessing vectors using key, index or hash followed by values, in any order 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 16
  • 17. Clustering Implementations • Contains several Spark and MapReduce enabled, such as k-means, fuzzy k- means, Canopy, Dirichlet and mean shift, Latent Dirichlet Allocation, • Spectral Clustering, MultiHash Clustering • Hierarchical clustering 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 17
  • 18. Classification Implementations • Single Machine Sequential: Stochastic Gradient Descent (SGD), Logistic Regression trained via SGD, Hidden Markov, Multi-layer Perceptron • Parallel Distributed (Map Reduce): • Naïve Bayes, complementary Naïve Bayes and Random Forest 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 18
  • 19. Collaborative Filtering • Enables making automatic predictions about items and item sets of interests • Similar item-sets mining 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 19
  • 20. SVD++ and SGD • Weighted matrix factorization SVD++, and parallel SGD (in sequentially in shared-data environment) • SGD [an iterative learning algorithm in which each training example is used to pull the model M program slightly to reach more closer to correct answer] 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 20
  • 21. Recommender Implementation • Collaborative filtering based recommender, • SVD recommender • KNN-item-based recommender (linear interpolation item based recommender) • Cluster-based recommender 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 21
  • 22. Implementation APIs • APIs for distributed and in-core first and second moment routines • Distributed and local principal component analysis (DSPCA and SPCA) and stochastic singular value decomposition (DSSVD and SSVD), singular value decomposition (SVD), 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 22
  • 23. Implementation APIs • Distributed Cholesky QR (thinQR) • Distributed regularized Alternating Least Squares (DALS) • Matrix factorization with alternating least squares 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 23
  • 24. Principal Component Analysis (PCA) • Means a linear transformation method • Finds the directions of maximum variance in high-dimensional data • Projects those data for transformation onto a smaller dimensional subspace while retaining most of the information 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 24
  • 25. Principal Component Analysis (PCA) • Identifies patterns in data sets • Detect the correlation among the variables • Strong correlation: Try for reducing the dimensionality 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 25
  • 26. PCA Applications • PCA applies in a number of use cases, such as stock market predictions, and the analysis of gene expression data. 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 26
  • 27. 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 27 Summary We learnt: • Mahout for Fast and Efficient Processing of Big Data • A scalable generalized tensor and linear algebra solving engine, designed on top of Apache Hadoop. • Mahout implementations of Machine Learning Algorithms •
  • 28. 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 28 Summary We learnt: • Mahout for Similar item sets, Clustering, Classification, Recommender • PCA, SVD++, SGD • Collaborative Filtering •
  • 29. 2019 “Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics, Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India 29 End of Lesson 17 on Mahout