Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …

  1. 1. Text Classification and ClusteringwithWEKAWEKAA guided example bySergio Jiménez
  2. 2. The TaskBuilding a model for movies revisions in Englishfor classifying it into positive or negative.
  3. 3. Sentiment Polarity Dataset Version 2.01000 positive movie review and 1000 negative review texts from:Thumbs up? Sentiment Classification using Machine LearningTechniques. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.Proceedings of EMNLP, pp. 79--86, 2002.“Our data source was the Internet Movie Database (IMDb) archive of“Our data source was the Internet Movie Database (IMDb) archive ofthe newsgroup.3 We selected only reviewswhere the author rating was expressed either with stars or somenumerical value (other conventions varied too widely to allow forautomatic processing). Ratings were automatically extracted andconverted into one of three categories: positive, negative, orneutral. For the work described in this paper, we concentrated onlyon discriminating between positive and negative sentiment.”
  4. 4. The Data (1/2)
  5. 5. The Data (2/2)050100150200250300350#Documents1000 negative revisions histogram0# characters050100150200250300#Documents# characters1000 positive revisions histogram
  6. 6. What WEKA is?• “Weka is a collection of machine learningalgorithms for data mining tasks”.• “Weka contains tools for:– data pre-processing,– classification,– regression,– clustering,– association rules,– and visualization”
  7. 7. Where to start?
  8. 8. Getting WEKA
  9. 9. Before Running WEKAIncreasing available memory for Java in RunWeka.iniChangemaxheap=256mtomaxheap=1024m
  10. 10. Running WEKAusing“RunWeka.bat”
  11. 11. Creating a .arff dataset
  12. 12. Saving the .arff dataset
  13. 13. From text to vectors],,,,,[ 321 classvvvvV nL=review1=“great movie”review2=“excellent film”review3=“worst film ever”review4=“sucks”excellent],0,0,1,1,0,0,0[1 +=V],0,0,0,0,1,1,0[2 +=V],1,0,0,0,1,0,1[3 −=V],0,1,0,0,0,0,0[4 −=Veverexcellentfilmgreatmoviesucksworst
  14. 14. Converting to Vector Space ModelEdit “movie_reviews.arff”and change “class” to“class1”. Apply the filteragain after the change.
  15. 15. Visualize the vector data
  16. 16. StringToWordVector filter optionslowerCase convertionTF-IDF weigthingStopwords removal using a listof words in a fileStemmingUse frequencies instead ofsingle presence
  17. 17. Generating datasets for experimentsdataset file name Stopwords StemmingPresence orfreq.movie_reviews_1.arff no presencemovie_reviews_2.arff no frequencymovie_reviews_2.arff no frequencymovie_reviews_3.arff yes presencemovie_reviews_4.arff yes frequencymovie_reviews_5.arff removed no presencemovie_reviews_6.arff removed no frequencymovie_reviews_7.arff removed yes presencemovie_reviews_8.arff removed yes frequency
  18. 18. Classifying ReviewsClick!Select numberSelect aclassifierSelect classattributeSelect numberof foldsStart !
  19. 19. Results
  20. 20. Results Correctly Classified Reviewsdataset name Stopwords StemmingPresenceor freq.NaiveBayes 3-foldNaiveBayesMultinomial3-foldmovie_reviews_1.arff no presence 80.65% 83.80%movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency 69.30% 78.65%movie_reviews_3.arff yes presence 79.40% 82.15%movie_reviews_4.arff yes frequency 68.10% 79.70%movie_reviews_5.arff removed no presence 81.80% 84.35%movie_reviews_6.arff removed no frequency 69.40% 81.75%movie_reviews_7.arff removed yes presence 78.90% 82.40%movie_reviews_8.arff removed yes frequency 68.30% 80.50%
  21. 21. Attribute (word) SelecctionChoose an AttributeSelection AlgorithmSelect theclass attribute
  22. 22. Selected Attributes (words)alsoawfulbadboringbothdullfailspointlesspoorridiculousscriptseagalsometimesstupiddeserveseffectiveflawsgreatesthilariousmemorableoverallgreatjokelamelifemanymaybemessnothingothersperfectperformancesstupidtaleterribletruevisualwastewastedworldworstanimationdefinitelyoverallperfectlyrealisticsharesolidsubtleterrificunlikeviewwonderfully
  23. 23. Pruned movie_reviews_1.arff dataset
  24. 24. Naïve Bayes with the pruned dataset
  25. 25. ClusteringCorrectly clustered instances: 65.25%
  26. 26. Other resultsResults of Pang et al. (2002) with version 1.0 of the dataset with 700+ and 700-
  27. 27. Thanks