Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analytics using Mahout

2,939 views

Published on

This lab describes the use of Apache Mahout for Machine Learning on a Hadoop platform.

Published in: Technology

Big Data Analytics using Mahout

  1. 1. Big Data Analytics Using Mahout Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute April 2015
  2. 2. 2 Mahout
  3. 3. 3 Mahout is a Java library which Implementing Machine Learning techniques for clustering, classification and recommendation What is Mahout?
  4. 4. 4 Mahout in Apache Software
  5. 5. 5 Why Mahout? Apache License Good Community Good Documentation Scalable Extensible Command Line Interface Java Library
  6. 6. 6 List of Algorithms
  7. 7. 7 List of Algorithms
  8. 8. 8 List of Algorithms
  9. 9. 9 Mahout Architecture
  10. 10. 10 Use Cases
  11. 11. 11 Installing Mahout
  12. 12. 12
  13. 13. 13 Select a EC2 service and click on Lunch Instance
  14. 14. 14 Choose My AMIs and select “Hadoop Lab Image”
  15. 15. 15 Choose m3.medium Type virtual server
  16. 16. 16 Leave configuration details as default
  17. 17. 17 Add Storage: 20 GB
  18. 18. 18 Name the instance
  19. 19. 19 Select an existing security group > Select Security Group Name: default
  20. 20. 20 Click Launch and choose imchadoop as a key pair
  21. 21. 21 Review an instance / click Connect for an instruction to connect to the instance
  22. 22. 22 Connect to an instance from Mac/Linux
  23. 23. 23 Connect to an instance from Windows using Putty
  24. 24. 24 Connect to the instance
  25. 25. 25 Install Maven $ sudo apt-get install maven $ mvn -v
  26. 26. 26 Install Subversion $ sudo apt-get install subversion $ svn --version
  27. 27. 27 Install Mahout $ cd /usr/local/ $ sudo mkdir mahout $ cd mahout $ sudo svn co http://svn.apache.org/repos/asf/mahout/trunk $ cd trunk $ sudo mvn -DskipTests
  28. 28. 28 Install Mahout (cont.)
  29. 29. 29 Edit batch files $ sudo vi $HOME/.bashrc $ exec bash
  30. 30. 30 Running Recommendation Algorithms
  31. 31. 31 MovieLens http://grouplens.org/datasets/movielens/
  32. 32. 32 Architecture for Recommender Engine
  33. 33. 33 Item-Based Recommendation Step 1: Gather some test data Step 2: Pick a similarity measure Step 3: Configure the Mahout command Step 4: Making use of the output and doing more with Mahout
  34. 34. 34 Preparing Movielen data $ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip $ unzip ml-100k.zip $ hadoop fs -mkdir /input $ hadoop fs -put u.data /input/u.data $ hadoop fs -mkdir /results $ unset MAHOUT_LOCAL
  35. 35. 35 Running Recommend Command $ mahout recommenditembased -i /input/u.data -o /results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD --tempDir /temp/recommend1 $ hadoop fs -ls /results/itemRecom.txt
  36. 36. 36 View the result $ hadoop fs -cat /results/itemRecom.txt/part-r-00000
  37. 37. 37 Similarity Classname SIMILARITY_COOCCURRENCE SIMILARITY_LOGLIKELIHOOD SIMILARITY_TANIMOTO_COEFFICIENT SIMILARITY_CITY_BLOCK SIMILARITY_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_EUCLIDEAN_DISTANCE
  38. 38. 38 Running Recommendation in a single machine $ export MAHOUT_LOCAL=true $ mahout recommenditembased -i ml-100k/u.data -o /results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD --numRecommendations 5 $ cat results/itemRecom.txt/part-r-00000
  39. 39. 39 Running Example Program Using CBayes classifer
  40. 40. 40 Running Example Program
  41. 41. 41 Preparing data $ export WORK_DIR=/tmp/mahout-work-${USER} $ mkdir -p ${WORK_DIR} $ mkdir -p ${WORK_DIR}/20news-bydate $ cd ${WORK_DIR}/20news-bydate $ wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz $ tar -xzf 20news-bydate.tar.gz $ mkdir ${WORK_DIR}/20news-all $ cd $ cp -R ${WORK_DIR}/20news-bydate/*/* $ {WORK_DIR}/20news-all
  42. 42. 42 Note: Running on MapReduce If you want to run onMapReduce mode, you need to run the following commands before running the feature extraction commands $ unset MAHOUT_LOCAL $ hadoop fs -put ${WORK_DIR}/20news-all $ {WORK_DIR}/20news-all
  43. 43. 43 Preparing the Sequence File Mahout provides you a utility to convert the given input file in to a sequence file format. The input file directory where the original data resides. The output file directory where the clustered data is to be stored.
  44. 44. 44 Sequence Files Sequence files are binary encoding of key/value pairs. There is a header on the top of the file organized with some metadata information which includes: – Version – Key name – Value name – Compression To view the sequential file mahout seqdumper -i <input file> | more
  45. 45. 45 Generate Vectors from Sequence Files Mahout provides a command to create vector files from sequence files. mahout seq2sparse -i <input file path> -o <output file path> Important Options: -lnorm Whether output vectors should be logNormalize. -nv Whether output vectors should be NamedVectors -wt The kind of weight to use. Currently TF or TFIDF. Default: TFIDF
  46. 46. 46 Extract Features Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
  47. 47. 47 Prepare Testing Dataset Split the preprocessed dataset into training and testing sets.
  48. 48. 48 Training process Train the classifier.
  49. 49. 49 Testing the result Test the classifier.
  50. 50. 50 Dumping a vector file We can dump vector files to normal text ones, as fillow mahout vectordump -i <input file> -o <output file> Options --useKey If the Key is a vector than dump that instead --csv Output the Vector as CSV --dictionary The dictionary file.
  51. 51. 51 Sample Output
  52. 52. 52 Command line options
  53. 53. 53 Command line options
  54. 54. 54 Command line options
  55. 55. 55 K-means clustering
  56. 56. 56 Reuters Newswire
  57. 57. 57 Preparing data $ export WORK_DIR=/tmp/kmeans $ mkdir $WORK_DIR $ mkdir $WORK_DIR/reuters-out $ cd $WORK_DIR $ wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz $ mkdir $WORK_DIR/reuters-sgm $ tar -xzf reuters21578.tar.gz -C $WORK_DIR/reuters-sgm
  58. 58. 58 Convert input to a sequential file $ mahout org.apache.lucene.benchmark.utils.ExtractReuters $WORK_DIR/reuters-sgm $WORK_DIR/reuters-out
  59. 59. 59 Convert input to a sequential file (cont) $ mahout seqdirectory -i $WORK_DIR/reuters-out -o $WORK_DIR/reuters-out-seqdir -c UTF-8 -chunk 5
  60. 60. 60 Create the sparse vector files $ mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector
  61. 61. 61 Running K-Means $ mahout kmeans -i $WORK_DIR/reuters-out-seqdir-sparse- kmeans/tfidf-vectors/ -c $WORK_DIR/reuters-kmeans-clusters -o $WORK_DIR/reuters-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow
  62. 62. 62 K-Means command line options
  63. 63. 63 Viewing Result $mkdir $WORK_DIR/reuters-kmeans/clusteredPoints $ mahout clusterdump -i $WORK_DIR/reuters-kmeans/clusters- *-final -o $WORK_DIR/reuters-kmeans/clusterdump -d $WORK_DIR/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 --pointsDir $WORK_DIR/reuters-kmeans/clusteredPoints
  64. 64. 64 Viewing Result
  65. 65. 65 Dumping a cluster file We can dump cluster files to normal text ones, as fillow mahout clusterdump -i <input file> -o <output file> Options -of The optional output format for the results. Options: TEXT, CSV, JSON or GRAPH_ML -dt The dictionary file type --evaluate Run ClusterEvaluator
  66. 66. 66 Canopy Clustering
  67. 67. 67 Fuzyy k-mean Clustering
  68. 68. 68 Command line options
  69. 69. 69 Exercise: Traffic Accidents Dataset http://fimi.ua.ac.be/data/accidents.dat.gz
  70. 70. 70 Import-Export RDBMS data
  71. 71. 71 Sqoop Hands-On Labs 1. Loading Data into MySQL DB 2. Installing Sqoop 3. Configuring Sqoop 4. Installing DB driver for Sqoop 5. Importing data from MySQL to Hive Table 6. Reviewing data from Hive Table 7. Reviewing HDFS Database Table files
  72. 72. Thanachart Numnonda, thanachart@imcinstitute.com Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop 1. MySQL RDS Server on AWS A RDS Server is running on AWS with the following configuration > database: imc_db > username: admin > password: imcinstitute >addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com [This address may change]
  73. 73. 73 1. country_tbl data Testing data query from MySQL DB Table name > country_tbl
  74. 74. 74 2. Installing Sqoop # wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop- 1.0.0.tar.gz # tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz # sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/ # rm sqoop-1.4.5.bin__hadoop-1.0.0
  75. 75. 75 Installing Sqoop Edit $HOME ./bashrc # sudo vi $HOME/.bashrc
  76. 76. 76 3. Configuring Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/conf/ ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh
  77. 77. 77 4. Installing DB driver for Sqoop ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop- 1.0.0/lib/ ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$ wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$ exit
  78. 78. 78 5. Importing data from MySQL to Hive Table [hdadmin@localhost ~]$sqoop import --connect jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west- 2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl --hive-import --hive-table country -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. Enter password: <enter here>
  79. 79. 79 6. Reviewing data from Hive Table
  80. 80. 80 7. Reviewing HDFS Database Table files Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse
  81. 81. 81 Sqoop commands
  82. 82. 82 Recommended Books
  83. 83. 83 www.facebook.com/imcinstitute
  84. 84. 84 Thank you thanachart@imcinstitute.com www.facebook.com/imcinstitute www.slideshare.net/imcinstitute www.thanachart.org

×