SlideShare a Scribd company logo
 Group 7 
 Dec 08/12/2014 
 Prof Shanoan Tian
• Uses Decision trees as a base. 
 Normally one tree is used 
 Tree is explored in depth - many branches; many 
leaves 
 It has to be monitored and tailored by a trained 
statistician
•Many decision trees 
•Shallow exploration 
•Slightly different dataset for each 
•Specified questions
• Apache Mahout is a library implemented on top of Apache 
Hadoop 
• Scalable machine-learning algorithms 
• Using the MapReduce paradigm. 
• Data Mining tools for the data stored on a Hadoop system 
• Clustering 
• Classification 
• Batch based collaborative filtering
The dataset we are using is the NSL-KDD 
dataset [2]. It is an improvement on the 
KDD 
99 dataset [1]. The KDD 1999 Data Set 
was a set of data of simulated computer 
network intrusion.
 An Apache Software Foundation project to 
create scalable machine learning libraries 
 http://mahout.apache.org 
 Why Mahout? 
 Scalable Machine Learning Algorithms 
Map Reduce Implementations on Apache 
Hadoop
 Apache Mahout has several classification 
algorithms implementations 
 Naïve Bayes 
 Complementary Naïve Bayes - 
 Random Forest 
 Hidden Markov Models 
 Logistic Regression
 Most algorithms have a Driver program 
 Shell script in $MAHOUT_HOME/bin helps with most tasks 
 Prepare the Data 
 Different algorithms require different setup 
 Run the algorithm 
 Single Node 
 Hadoop 
 Print out the results 
 Several helper classes: 
 ClusterDumper, etc.
 Make directory in HDFS 
$ hadoop fs -mkdir testdata 
 Load data to HDFS 
$ hadoop fs -put ./downloads/data/* testdata 
 Verify data loading 
$ hadoop fs -ls testdata
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describ 
e -p testdata/KDDTrain+_20Percent.arff -f 
testdata/KDDTrain+_20Percent.info -d N 3 C 2 
N C 4 N C 8 N 2 C 19 N L 
 The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L” 
describes all the attributes of the data. 1 
numerical(N) attribute, followed by 3 
Categorical(C) attributes, ...L indicates the label
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Build Time: 0h 0m 25s 949 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest num Nodes: 62706 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest mean num Nodes: 627 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest mean max Depth: 14
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.T 
estForest -i testdata/KDDTest+.arff -ds 
testdata/KDDTrain+_20Percent.info -m nsl-forest 
-a -mr -o predictions 
 Predicts on "KDDTest+.arff" dataset (-i argument) using the same data 
descriptor generated for the training set (-ds) and the decision forest built 
previously (-m) computes the confusion matrix (-a) 
 Passing the (-mr) parameter to use Hadoop Mapreduce framework
Correctly Classified Instances 
17639 78.2425% 
In correctly Classified Instances 
4905 21.7575% 
Total Classified Instances 
22544 100%
Confusion 
Matrix 
A = normal B = anamoly Total 
9454 257 9711 
4648 8185 12833
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.TestForest - 
i testdata/KDDTest+.arff -ds 
testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr 
-o predictions 
 Predicts on "KDDTest+.arff" dataset (-i argument) using 
the same data descriptor generated for the training set (- 
ds) and the decision forest built previously (-m) 
computes the confusion matrix (-a) 
 Passing the (-mr) parameter to use Hadoop Mapreduce 
framework 
 http://nsl.cs.unb.ca/NSL-KDD/

More Related Content

Similar to Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
agiamas
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
Amar kumar
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Hadoop
HadoopHadoop
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Rabindra Nath Nandi
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Tarak Tar
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 

Similar to Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout (20)

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 

More from Cisco

Big data
Big dataBig data
Big data
Cisco
 
Colloborative computing
Colloborative computing Colloborative computing
Colloborative computing
Cisco
 
mobile case_presentation_byod_dey_sushmita
 mobile case_presentation_byod_dey_sushmita mobile case_presentation_byod_dey_sushmita
mobile case_presentation_byod_dey_sushmita
Cisco
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association Rule
Cisco
 
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...Cisco
 
Time Series Forecasting
Time Series ForecastingTime Series Forecasting
Time Series Forecasting
Cisco
 
Kenneth Lay
Kenneth LayKenneth Lay
Kenneth LayCisco
 

More from Cisco (7)

Big data
Big dataBig data
Big data
 
Colloborative computing
Colloborative computing Colloborative computing
Colloborative computing
 
mobile case_presentation_byod_dey_sushmita
 mobile case_presentation_byod_dey_sushmita mobile case_presentation_byod_dey_sushmita
mobile case_presentation_byod_dey_sushmita
 
Clustering and Association Rule
Clustering and Association RuleClustering and Association Rule
Clustering and Association Rule
 
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
 
Time Series Forecasting
Time Series ForecastingTime Series Forecasting
Time Series Forecasting
 
Kenneth Lay
Kenneth LayKenneth Lay
Kenneth Lay
 

Recently uploaded

Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 

Recently uploaded (20)

Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout

  • 1.
  • 2.  Group 7  Dec 08/12/2014  Prof Shanoan Tian
  • 3. • Uses Decision trees as a base.  Normally one tree is used  Tree is explored in depth - many branches; many leaves  It has to be monitored and tailored by a trained statistician
  • 4. •Many decision trees •Shallow exploration •Slightly different dataset for each •Specified questions
  • 5.
  • 6. • Apache Mahout is a library implemented on top of Apache Hadoop • Scalable machine-learning algorithms • Using the MapReduce paradigm. • Data Mining tools for the data stored on a Hadoop system • Clustering • Classification • Batch based collaborative filtering
  • 7. The dataset we are using is the NSL-KDD dataset [2]. It is an improvement on the KDD 99 dataset [1]. The KDD 1999 Data Set was a set of data of simulated computer network intrusion.
  • 8.  An Apache Software Foundation project to create scalable machine learning libraries  http://mahout.apache.org  Why Mahout?  Scalable Machine Learning Algorithms Map Reduce Implementations on Apache Hadoop
  • 9.  Apache Mahout has several classification algorithms implementations  Naïve Bayes  Complementary Naïve Bayes -  Random Forest  Hidden Markov Models  Logistic Regression
  • 10.  Most algorithms have a Driver program  Shell script in $MAHOUT_HOME/bin helps with most tasks  Prepare the Data  Different algorithms require different setup  Run the algorithm  Single Node  Hadoop  Print out the results  Several helper classes:  ClusterDumper, etc.
  • 11.
  • 12.
  • 13.  Make directory in HDFS $ hadoop fs -mkdir testdata  Load data to HDFS $ hadoop fs -put ./downloads/data/* testdata  Verify data loading $ hadoop fs -ls testdata
  • 14.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.tools.Describ e -p testdata/KDDTrain+_20Percent.arff -f testdata/KDDTrain+_20Percent.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L  The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L” describes all the attributes of the data. 1 numerical(N) attribute, followed by 3 Categorical(C) attributes, ...L indicates the label
  • 15.
  • 16.  14/12/08 07:37:53 INFO mapreduce.BuildForest: Build Time: 0h 0m 25s 949  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest num Nodes: 62706  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest mean num Nodes: 627  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest mean max Depth: 14
  • 17.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.T estForest -i testdata/KDDTest+.arff -ds testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr -o predictions  Predicts on "KDDTest+.arff" dataset (-i argument) using the same data descriptor generated for the training set (-ds) and the decision forest built previously (-m) computes the confusion matrix (-a)  Passing the (-mr) parameter to use Hadoop Mapreduce framework
  • 18. Correctly Classified Instances 17639 78.2425% In correctly Classified Instances 4905 21.7575% Total Classified Instances 22544 100%
  • 19. Confusion Matrix A = normal B = anamoly Total 9454 257 9711 4648 8185 12833
  • 20.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest - i testdata/KDDTest+.arff -ds testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr -o predictions  Predicts on "KDDTest+.arff" dataset (-i argument) using the same data descriptor generated for the training set (- ds) and the decision forest built previously (-m) computes the confusion matrix (-a)  Passing the (-mr) parameter to use Hadoop Mapreduce framework  http://nsl.cs.unb.ca/NSL-KDD/