Introduction to text classification using scala

1,631 views

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,631
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Introduction to text classification using scala

  1. 1. Introduction to Text Classification using Scala Sudharshan
  2. 2. What is Text classification Text Classification using machine learning and NLP is to classify the documents into separate categories based on the linguistic features present in the documents.
  3. 3. Preprocessing TextDirectoryLoader Tokenizer Vectorizer Folder containing subfolders of each class text files (eg: Pos, Neg) Instances with class and other attributes Instances with Feature Vectors Processed Train Set in Memory
  4. 4. Classifier Training Classifier Training Algorithm Training set Feature vectors of all instances Model File built Test a Single Instance Read a string and classify it Prediction Reads a string and preprocess and classify it as a particular class. Some ML classification algorithm
  5. 5. Testing the Model: Cross Validation Load TestSet Data Randomize Data Cross Validation Preproces sing Tokenize Y e s NO Confusion Matrix results
  6. 6. Demo using Scala Breeze/Nak For our demo we are using this SBT dependency libraryDependencies += "org.scalanlp" % "nak" % "1.1.3" https://github.com/scalanlp/nak
  7. 7. Simple Example for Training and Evaluation object TwentyNewsExample extends App { val directoryLocation="/media/home/Work/Backup/365mediaBackup/corpus/20news- bydate"; val newsgroupsDir = new File(directoryLocation) implicit val isoCodec = scala.io.Codec("ISO-8859-1") val stopwords = Set("the","a","an","of","in","for","by","on") val trainDir = new File(newsgroupsDir, "small_train") val trainingExamples = fromLabeledDirs(trainDir).toList val featurizer = new BowFeaturizer(stopwords)
  8. 8. Training and Evaluation code //Training Process val config = LiblinearConfig(cost=5.0) val classifier = trainClassifier(config, featurizer, trainingExamples) println("training done ") //Evaluation Process val evalDir = new File(newsgroupsDir, "small_test") val maxLabelNews = maxLabel(classifier.labels) _ val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features) val (goldLabels, predictions, inputs) = comparisons.unzip3 println(ConfusionMatrix(goldLabels, predictions, inputs)) }
  9. 9. Code and dataset will be available https://github. com/rsudharshan/DataScienceWithScala also in the Nak github page
  10. 10. Questions ?? Thank You

×