Random forest using apache mahout

6,845 views
6,297 views

Published on

Random Forest Model using Apache Mahout

Published in: Education, Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,845
On SlideShare
0
From Embeds
0
Number of Embeds
479
Actions
Shares
0
Downloads
163
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Random forest using apache mahout

  1. 1. CS 267 : Data Mining Presentation Guided by : Dr. Tran -Gaurav Kasliwal
  2. 2. Outline  RandomForest Model  Mahout Overview  RandomForest using Mahout  Problem Description  Working Environment  Data Preparation  ML Model Generation  Demo  Using Gini Index
  3. 3. RandomForest Model  Random forests are an ensemble learning method for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.  Developed by Leo Breiman and Adele Cutler.
  4. 4. Mahout  Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.  Scalable to large data sets
  5. 5. RandomForest using Mahout  Generate a file descriptor for the dataset.  Run the example with train data and build Decision Forest model.  Use the Decision Forest model to Classify test data and get results.  Tuning the model to get better results.
  6. 6. Problem Definition  To Benchmark machine learning model for Page-Rank  Yahoo! Learning to Rank  Train Data : 34815 Records  Test Data : 130166 Records  Data Description :  {R} | {q_id} | {List: feature_id -> feature_value}  where R = {0, 1, 2, 3, 4}  q_id = query id (number)  feature_id = number feature_value = 0 to 1
  7. 7. Working Environment  Ubuntu  Hadoop 1.2.1  Mahout 0.9
  8. 8. Prepare Dataset  Take data from input text file  Make a .csv file  Make directory in HDFS and upload train.csv and test.csv to the folder.  Data Loading (Load data to HDFS)  #hadoop fs -put train.arff final_data  #hadoop fs -put test.arff final_data  #hadoop fs -ls final_data (check by ls command )
  9. 9. Using Mahout make metadata: #hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p final_data/train.csv -f final_data/train.info1 -d 702 N L  It creates a metadata train.info1 in final_data folder.
  10. 10. Create Model make model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -sl 5 -p -t 100 -o final-forest
  11. 11. Test Model test model #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -p -t 1000 -o final-forest
  12. 12. Results Summary results : Confusion Matrix and statistics
  13. 13. Tuning  (change the parameters -t and -sl) and check the results.  --nbtrees (-t) nbtrees Number of trees to grow  --selection (-sl) m Number of variables to select randomly at each tree-node.
  14. 14. Results  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest - Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o final-forest2  #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i final_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
  15. 15. RF Split selection  Typically we select about square root (K) when there are K is the total number of predictors available  If we have 500 columns of predictors we will select only about 23  We split our node with the best variable among the 23, not the best variable among the 500
  16. 16. Using Gini Index  If a dataset T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:  **The attribute value that provides the smallest SPLIT Gini (T) is chosen to split the node.
  17. 17. Example  The example below shows the construction of a single tree using the dataset .  Only two of the original four attributes are chosen for this tree construction.
  18. 18.  tabulates the gini index value for the HOME_TYPE attribute at all possible splits.  the split HOME_TYPE <= 10 has the lowest value Gini SPILT Value Gini SPILT(HOME_TYPE<=6) 0.4000 Gini SPILT(HOME_TYPE<=10) 0.2671 Gini SPILT(HOME_TYPE<=15) 0.4671 Gini SPILT(HOME_TYPE<=30) 0.3000 Gini SPILT(HOME_TYPE<=31) 0.4800

×