Vipul divyanshu mahout_documentation

Data Analytics
Project Documentation
Vipul Divyanshu
IIL/2012/14
Summer Internship
Mentor: Saish Kamat
India Innovation Labs
Tasks at hand:
*Data Analytics on a Medium Size Data Base

*Building an Recommender Engine for products

Tools and topics Explored:

Mahout
Root
Hadoop
Data Rush
Rush Analyser (with KNIME)
Google Analytics engine

Analysis of the tools and what was explored:

Mahout:Mahout is an open source machine learning library from Apache. The
algorithmsit implements fall under the broad umbrella of machine learning or collective
Intelligence.

Mahout currently has:
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition

Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)

The fact that mahout has this many features and sub tools and libraries to work with,
it is the best suited tool for the self-designed data analytics programs.
And mahout also has core libraries are highly optimized to allow for good
performance also for non-distributed algorithms.

NOTE: For a well understanding of Mahout, the book ‗Mahout In action‘ is suggested.

ROOT:It is an object-oriented framework aimed at solving the
dataanalysis challenges of high-energy physics.
Below, you can find a quick overview of the ROOT framework:
Save data. You can save your data (and any C++ object) in a compressed
binaryform in a ROOT file. The object format is also saved in the same file.
ROOT
provides a data structure that is extremely powerful for fast access of huge
amounts of data - orders of magnitude faster than any database.
Access data. Data saved into one or several ROOT files can be accessed
from your PC, from the web and from large-scale file delivery systems used
e.g. in the GRID. ROOT trees spread over several files can be chained and
accessed as a unique object, allowing for loops over huge amounts of data.
Process data. Powerful mathematical and statistical tools are provided to
operate on your data. The full power of a C++ application and of parallel
processing is available for any kind of data manipulation. Data can also
be generated following any statistical distribution, making it possible to
simulate complex systems.
Show results. Results are best shown with histograms, scatter plots,
fitting functions, etc. ROOT graphics may be adjusted real-time by few
mouse clicks. High-quality plots can be saved in PDF or other format.
Interactive or built application. You can use the CINT C++ interpreter or
Python for your interactive sessions and to write macros, or compile your
program to run at full speed. In both cases, you can also create a GUI.
Link to know more about root: http://root.cern.ch/drupal/
Link for ROOT user‘s guide: http://root.cern.ch/download/doc/ROOTUsersGuide.pdf
Constrains of ROOT:
What was found was that it is concentrates more on displaying and the graphical
presentation of the collected data and on the representation of computed
(processed) result in the form of canvas, histograms, TGraphs. This can be used in later
point of time to present the processed data in a well-defined and interactive manner.

Screenshot:

HADOOP:
Hadoop is an open source framework for writing and running distributed
applications that process large amounts of data on different networks.
Key distinctions of Hadoop are:
Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computingservices such as Amazon‘s Elastic Compute Cloud (EC2).
Robust—because it is intended to run on commodity hardware, Hadoop is architected
withthe assumption of frequent hardware malfunctions. It can gracefully handle most
suchfailures.
Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the
cluster.
Simple—Hadoop allows users to quickly write efficient parallel code.

Link to explore more in Hadoop: http://hadoop.apache.org/
NOTE: For a well understanding of hadoop, the book ‗Hadoop In action‘ is suggested.

Setting Up Mahout development environment in Eclipse:
NOTE: The following explanation is for Ubuntu (Linux) OS .we can even implement
it on any other OS such as windows.
PREREQUIREMENTS:
1. Java SDK 6u23 x64
2. Maven 3.0.2
3. ANY UPDATED MAHOUT LIBRARY
4. IDE(I had used eclipse)
5. CYGWIN (in case of windows OS)
Running your first sample code:
Once all the above requirements are met we are ready to execute our
first sample code.
Step 1:
At first, start Eclipse and create a workspace. We takeit ―UsersVipulworkspace‖ for
the present.
Extract the source of Mahout below the workspace. Itis
―UsersVipulworkspacemahout-distribution-0.4″ for the present.
Convert Maven project of Mahout into Eclipse project with the belowcommand.
cd UsersVipulworkspacemahout-distribution-0.4
mvn eclipse: eclipse
Now set the classpath variable M2_REPO of Eclipse to Maven2 localrepository.
mvn -Declipse.workspace= eclipse: add-maven-repo
But ―Maven – Guide to using Eclipse with Maven 2.x‖ says ―Issue: The
command does not work‖. So set it in Eclipse directly.
Open Window > Preferences > Java > Build Path > Classpath Valirables
from Eclipse‘s menu.
Press ―New‖ and Add Name as ―M2_REPO‖ and Path as Maven 2
repository path (its default is .m2/repository at your user directory).
Finally import the converted Eclipse project of Mahout.
Open File > Import > General > Existing Projects into Workspace from
Eclipse menu.
Select the project directory UsersVipulworkspacemahoutdistribution-0.6 and all projects.
NOTE: Now you need to have your first code to be implemented
ready. If so proceed to Step 2.
Step 2:
At first, generate a Maven project for sample codes on the Eclipse
workspace directory.
$ cd Users/Vipul/workspace
$ mvn archetype: create -DgroupId=mia.recommender -
DartifactId=recommender
Do the following.
Delete a generated Skelton code src/main/App.java and copy the
code into src/main/java/mia/recommender of the ‗recommender‘
project.
Convert the Maven project into Eclipse project.
$ cd Users/Vipul/workspace/recommender
$ mvn eclipse: eclipse
Import the project into Eclipse.
Open File > Import > General > Existing Projects into Workspace
from Eclipse menu and select the ‗recommender‘ project.
Then the ‗recommender‘ project is available on Eclipse workspace,
but all classes have errors because of no Mahout Library reference.

Right click the „recommender‟ project, select Properties > Java Build Path >
Projects from pop-up menu and click „Add‟ and select the below Mahout projects.

mahout-core
mahout-examples
mahout-taste-webapp
mahout-math
mahout-utils

Then only 4 errors remain.

Hence they are conflicts with updated APIs, these error correction need to modify codes.
For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1
at error line in it.

This error says that the code does not catch or declare a exception of
TasteException which NearestNUserNeighborhood‟s constructor throws. So you
can choise whichever you like a solution in the pop up menu. Others as well.

The classes which has main() function can be executed on Eclipse.
For example, select mia.recommender.ch02.RecommenderIntro and click Run >
Run in Eclipse‟s menu (or may press ctrl+F11 insted). Then It throws an
exception as „Exception in thread “main” java.io.FileNotFoundException:
intro.csv‟.
To make it read a sample data file „intro.csv‟ in src/mia/recommender/ch02, click
Run > Run Configurations in Eclipse‟s menu and select the configuration of

RecommenderIntro which is created by the above execution. Then set
mia/recommender/ch02 to Working directory in Arguments tab(see the below
figure). Click “Workspace…” button and select the directory.

Then it outputs a result like “RecommendedItem[item:104, value:4.257081]“.
If you want to make a project, repeat from Maven project creation.

RECOMMENDATION ENGINE:
Recommendation isall about predicting patterns of taste, and using them to discover
new and desirable things you didn‘t already know about.We have many types of
recommender like:
GenericUserBasedRecommender
GenericItemBasedRecommender
SlopeOneRecommender
SVDRecommender
KnnItemBasedRecommender
Well I had implemented the code for the first three but with time in hand the
other two and some more can be implemented.
NOTE: For every recommender to feed the data to it we need a file normally of
type .csv and don‘t forget to place it in the same folder in which we have our
pom file of the current project being build.
THE USER BASED RECOMMENDATION ENGINE
All the required details of the user based recommender engine are given in detail in the
book which I had mentioned before. The output of my recommender is shown below:

The output if the above code can be observed in the ellipse.

THE ITEM BASED RECOMMENDATION ENGINE:
It is similar to that of the user based recommendation engine the only difference is that
it finds the similarity between the item instead of users.
Note: Due to the above reason it is more suited in the case when we is a fast growing list
of users and a slower growing product or item list.
The output of the Item based recommender code is:

THE SLOPE-ONE RECOMMENDATION ENGINE:
It is similar to that to of item based recommendation engine but has a pre-processing
state and the output is on the basis of the relation between the different items.
The output of my code is:

THE EVLUATOR FOR RECOMMENDATION ENGINE:
There are many possible ways to evaluate the performance of an the recommender
engine, I have explored the following:
RecommenderIRStatsEvaluator
AverageAbsoluteDifferenceRecommenderEvaluator
RMSRecommenderEvaluator.
Well I had implemented the first two of them
AVERAGEABSOLUTEDIFFERENCERECOMMENDEREVALUATOR
It takes the a part of data as test data and rest as training data and recommends items
for our test data and latter is matched with the real values of the test data. The output
for my code is:

RecommenderIRStatsEvaluator:
This evaluator computes the recall and precision of the recommender and gives their
values as the output. The output of the evaluator code is:

Note: To test the above codes on a larger scale we can download the
Input files for them from: http://www.grouplens.org/node/12
Mahout is still in development stage and still many fields can be explored
like clustering, network pattern learning and classification.
The Hadoop could be used with mahout to implement a cluster and map-
reduce to receive data.

Rush Analyser (with Knime):
This tool is also in Java and eclipse is needed. It was downloaded from the link :
http://bigdata.pervasive.com/Products/Download-Center.aspx
Is the graphical version of Data rush and is very handy in the terms of data analytics and
visualisation.
Here is a snapshot of my work where I have loaded the 10K movie rating data
downloaded from the test data download link given.

In the image different nodes can be seen used to perform different operations on the
data set.

This is the parallel plot of the data set.

This is the scatter plot generated for the same 10K data value scattered on the 2-D plan.

By use of clustering blocks in the rush analyser the data was analysed.
Few of the blocks explored by me are:
Regression
Classifiers
Recommender
Clustering
Filters
Data from different Databases can be directly imported by the use of Data Base reader
block.
These are few of the topics explored in Rush Analyser(a interactive Datarush tool)

And it only the tip of the ice berg as Rush Analyser has a lot more in store to be
explored.For more info go to The given link could be referred for exploring data rush:
http://bigdata.pervasive.com/Products/Analytic-Engine-Pervasive-DataRush.aspx.
The potential of the DATA RUSHis still to be explored for the project.

Thank You IIL:
Vipul Divyanshu
IIL/2012/14

Vipul divyanshu mahout_documentation

Recommended

Recommended

More Related Content

Similar to Vipul divyanshu mahout_documentation

Similar to Vipul divyanshu mahout_documentation (20)

Recently uploaded

Recently uploaded (20)

Vipul divyanshu mahout_documentation