Vipul divyanshu mahout_documentation


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Vipul divyanshu mahout_documentation

  1. 1. Data Analytics Project Documentation Vipul Divyanshu IIL/2012/14 Summer Internship Mentor: Saish Kamat India Innovation LabsTasks at hand:*Data Analytics on a Medium Size Data Base*Building an Recommender Engine for productsTools and topics Explored: Mahout Root Hadoop Data Rush Rush Analyser (with KNIME) Google Analytics engineAnalysis of the tools and what was explored:Mahout:Mahout is an open source machine learning library from Apache. Thealgorithmsit implements fall under the broad umbrella of machine learning or collectiveIntelligence.Mahout currently has: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition
  2. 2. Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections)The fact that mahout has this many features and sub tools and libraries to work with,it is the best suited tool for the self-designed data analytics programs.And mahout also has core libraries are highly optimized to allow for goodperformance also for non-distributed algorithms.NOTE: For a well understanding of Mahout, the book ‗Mahout In action‘ is suggested.ROOT:It is an object-oriented framework aimed at solving thedataanalysis challenges of high-energy physics.Below, you can find a quick overview of the ROOT framework: Save data. You can save your data (and any C++ object) in a compressed binaryform in a ROOT file. The object format is also saved in the same file. ROOT provides a data structure that is extremely powerful for fast access of huge amounts of data - orders of magnitude faster than any database. Access data. Data saved into one or several ROOT files can be accessed from your PC, from the web and from large-scale file delivery systems used e.g. in the GRID. ROOT trees spread over several files can be chained and accessed as a unique object, allowing for loops over huge amounts of data. Process data. Powerful mathematical and statistical tools are provided to operate on your data. The full power of a C++ application and of parallel processing is available for any kind of data manipulation. Data can also be generated following any statistical distribution, making it possible to simulate complex systems. Show results. Results are best shown with histograms, scatter plots, fitting functions, etc. ROOT graphics may be adjusted real-time by few mouse clicks. High-quality plots can be saved in PDF or other format. Interactive or built application. You can use the CINT C++ interpreter or Python for your interactive sessions and to write macros, or compile your program to run at full speed. In both cases, you can also create a GUI. Link to know more about root: Link for ROOT user‘s guide: of ROOT:What was found was that it is concentrates more on displaying and the graphicalpresentation of the collected data and on the representation of computed(processed) result in the form of canvas, histograms, TGraphs. This can be used in laterpoint of time to present the processed data in a well-defined and interactive manner.
  3. 3. Screenshot:HADOOP:Hadoop is an open source framework for writing and running distributedapplications that process large amounts of data on different networks.Key distinctions of Hadoop are:Accessible—Hadoop runs on large clusters of commodity machines or on cloudcomputingservices such as Amazon‘s Elastic Compute Cloud (EC2).Robust—because it is intended to run on commodity hardware, Hadoop is architectedwiththe assumption of frequent hardware malfunctions. It can gracefully handle mostsuchfailures.Scalable—Hadoop scales linearly to handle larger data by adding more nodes to thecluster.Simple—Hadoop allows users to quickly write efficient parallel code.Link to explore more in Hadoop: For a well understanding of hadoop, the book ‗Hadoop In action‘ is suggested.
  4. 4. Setting Up Mahout development environment in Eclipse:NOTE: The following explanation is for Ubuntu (Linux) OS .we can even implementit on any other OS such as windows.PREREQUIREMENTS:1. Java SDK 6u23 x642. Maven 3.0.23. ANY UPDATED MAHOUT LIBRARY4. IDE(I had used eclipse)5. CYGWIN (in case of windows OS)Running your first sample code:Once all the above requirements are met we are ready to execute ourfirst sample code.Step 1: At first, start Eclipse and create a workspace. We takeit ―UsersVipulworkspace‖ for the present. Extract the source of Mahout below the workspace. Itis ―UsersVipulworkspacemahout-distribution-0.4″ for the present. Convert Maven project of Mahout into Eclipse project with the belowcommand. cd UsersVipulworkspacemahout-distribution-0.4 mvn eclipse: eclipse Now set the classpath variable M2_REPO of Eclipse to Maven2 localrepository. mvn -Declipse.workspace= eclipse: add-maven-repo But ―Maven – Guide to using Eclipse with Maven 2.x‖ says ―Issue: The command does not work‖. So set it in Eclipse directly. Open Window > Preferences > Java > Build Path > Classpath Valirables from Eclipse‘s menu. Press ―New‖ and Add Name as ―M2_REPO‖ and Path as Maven 2 repository path (its default is .m2/repository at your user directory). Finally import the converted Eclipse project of Mahout. Open File > Import > General > Existing Projects into Workspace fromEclipse menu.Select the project directory UsersVipulworkspacemahoutdistribution-0.6 and all projects.NOTE: Now you need to have your first code to be implementedready. If so proceed to Step 2.Step 2: At first, generate a Maven project for sample codes on the Eclipse workspace directory. $ cd Users/Vipul/workspace $ mvn archetype: create -DgroupId=mia.recommender - DartifactId=recommender Do the following. Delete a generated Skelton code src/main/ and copy the code into src/main/java/mia/recommender of the ‗recommender‘ project. Convert the Maven project into Eclipse project. $ cd Users/Vipul/workspace/recommender $ mvn eclipse: eclipse Import the project into Eclipse. Open File > Import > General > Existing Projects into Workspace from Eclipse menu and select the ‗recommender‘ project. Then the ‗recommender‘ project is available on Eclipse workspace, but all classes have errors because of no Mahout Library reference.
  5. 5. Right click the „recommender‟ project, select Properties > Java Build Path >Projects from pop-up menu and click „Add‟ and select the below Mahout projects. mahout-core mahout-examples mahout-taste-webapp mahout-math mahout-utilsThen only 4 errors remain.
  6. 6. Hence they are conflicts with updated APIs, these error correction need to modify codes.For example, open mia.recommender.ch03.IREvaluatorBooleanPrefIntro2 and press ctrl+1at error line in it. This error says that the code does not catch or declare a exception of TasteException which NearestNUserNeighborhood‟s constructor throws. So you can choise whichever you like a solution in the pop up menu. Others as well. The classes which has main() function can be executed on Eclipse. For example, select mia.recommender.ch02.RecommenderIntro and click Run > Run in Eclipse‟s menu (or may press ctrl+F11 insted). Then It throws an exception as „Exception in thread “main” intro.csv‟. To make it read a sample data file „intro.csv‟ in src/mia/recommender/ch02, click Run > Run Configurations in Eclipse‟s menu and select the configuration of
  7. 7. RecommenderIntro which is created by the above execution. Then setmia/recommender/ch02 to Working directory in Arguments tab(see the belowfigure). Click “Workspace…” button and select the directory.Then it outputs a result like “RecommendedItem[item:104, value:4.257081]“.If you want to make a project, repeat from Maven project creation.
  8. 8. RECOMMENDATION ENGINE:Recommendation isall about predicting patterns of taste, and using them to discovernew and desirable things you didn‘t already know about.We have many types ofrecommender like: GenericUserBasedRecommender GenericItemBasedRecommender SlopeOneRecommender SVDRecommender KnnItemBasedRecommenderWell I had implemented the code for the first three but with time in hand theother two and some more can be implemented.NOTE: For every recommender to feed the data to it we need a file normally oftype .csv and don‘t forget to place it in the same folder in which we have ourpom file of the current project being build.THE USER BASED RECOMMENDATION ENGINEAll the required details of the user based recommender engine are given in detail in thebook which I had mentioned before. The output of my recommender is shown below:The output if the above code can be observed in the ellipse.
  9. 9. THE ITEM BASED RECOMMENDATION ENGINE:It is similar to that of the user based recommendation engine the only difference is thatit finds the similarity between the item instead of users.Note: Due to the above reason it is more suited in the case when we is a fast growing listof users and a slower growing product or item list.The output of the Item based recommender code is:THE SLOPE-ONE RECOMMENDATION ENGINE:It is similar to that to of item based recommendation engine but has a pre-processingstate and the output is on the basis of the relation between the different items.The output of my code is:
  10. 10. THE EVLUATOR FOR RECOMMENDATION ENGINE:There are many possible ways to evaluate the performance of an the recommenderengine, I have explored the following: RecommenderIRStatsEvaluator AverageAbsoluteDifferenceRecommenderEvaluator RMSRecommenderEvaluator.Well I had implemented the first two of themAVERAGEABSOLUTEDIFFERENCERECOMMENDEREVALUATORIt takes the a part of data as test data and rest as training data and recommends itemsfor our test data and latter is matched with the real values of the test data. The outputfor my code is:
  11. 11. RecommenderIRStatsEvaluator:This evaluator computes the recall and precision of the recommender and gives theirvalues as the output. The output of the evaluator code is:
  12. 12. Note: To test the above codes on a larger scale we can download theInput files for them from: is still in development stage and still many fields can be exploredlike clustering, network pattern learning and classification.The Hadoop could be used with mahout to implement a cluster and map-reduce to receive data.Rush Analyser (with Knime):This tool is also in Java and eclipse is needed. It was downloaded from the link : the graphical version of Data rush and is very handy in the terms of data analytics andvisualisation.Here is a snapshot of my work where I have loaded the 10K movie rating datadownloaded from the test data download link given.
  13. 13. In the image different nodes can be seen used to perform different operations on thedata set.This is the parallel plot of the data set.
  14. 14. This is the scatter plot generated for the same 10K data value scattered on the 2-D plan.By use of clustering blocks in the rush analyser the data was analysed.Few of the blocks explored by me are: Regression Classifiers Recommender Clustering FiltersData from different Databases can be directly imported by the use of Data Base readerblock.These are few of the topics explored in Rush Analyser(a interactive Datarush tool)
  15. 15. And it only the tip of the ice berg as Rush Analyser has a lot more in store to beexplored.For more info go to The given link could be referred for exploring data rush: potential of the DATA RUSHis still to be explored for the project.Thank You IIL:Vipul DivyanshuIIL/2012/14