Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

집단지성프로그래밍 ch6. 문서 필터링

629 views

Published on

세미나 자료

Published in: Technology
  • Login to see the comments

집단지성프로그래밍 ch6. 문서 필터링

  1. 1. 문서 필터링 집단지성 프로그래밍 Ch.6 허윤
  2. 2. Document Filtering  Filtering == Classification Problem Data Mining Problem EstimationClassification Predication Clustering Description Affinity Grouping  Document? A set of feature -> text document, image, etc.
  3. 3. Spam Filtering  Binary Classification Problem ‘Spam’ or ‘Ham’  Techniques Naïve Bayesian Classifier Support Vector Machine Decision Tree  Rule vs. Model
  4. 4. Spam Filtering in Practice Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM
  5. 5. Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de
  6. 6. Naïve Bayesian Classifier  Bayesian Classifier  Naïve? Bayesian Theorem with string independence assumption
  7. 7.  Example 1. 상자 A가 선택될 확률 P( A ) = 7 / 10 2. 상자 A에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10 3. 주머니에서는 A, 상자 A에서 흰공 뽑힐 확률 4. 흰공의 확률 ❶ ❷
  8. 8.  Example ❶ ❷ 어디선가 흰공이 나왔는데… P( A | 흰공 )A에서 나왔을 확률? B에서 나왔을 확률? P( B | 흰공 ) P( A | 흰공 ) = ?
  9. 9.  Example ❶ ❷
  10. 10.  Bayes Rule ❶ Conditional Prob. A given B ❷ Conditional Prob. B given A ❸ Bayes Rule
  11. 11. Implementation  Extracting words from document
  12. 12. Implementation: Preparation  Representation of classifier
  13. 13.  How to access dict Implementation: Preparation
  14. 14.  Training Implementation: Preparation
  15. 15.  Training Implementation: Preparation
  16. 16.  Training Implementation: Preparation
  17. 17. Recall  Bayesian Theorem p( category | doc ) = p( doc ) p( doc | category ) * p( category)
  18. 18. Implementation : Classifier  P( feature | category ) for prior
  19. 19.  Assumed Probability to resolve data sparseness Implementation : Classifier
  20. 20.  Assumed Probability to resolve data sparseness Implementation : Classifier
  21. 21.  P( document | category ) document representation Implementation : Classifier
  22. 22.  P( document | category ) * p( category ) Implementation : Classifier
  23. 23.  Classifier Implementation : Classifier
  24. 24.  Classifier Implementation : Classifier
  25. 25.  Recall: Naïve Bayesian Classifier Fisher’s Method  Fisher’s Method First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category ) p( category | document ) ?? p( category | feature ) = # of documents having feature in category # of documents having feature
  26. 26.  Q&A Thank You

×