Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
문서 필터링
집단지성 프로그래밍 Ch.6
허윤
Document Filtering
 Filtering == Classification Problem
Data Mining Problem
EstimationClassification
Predication
Clusteri...
Spam Filtering
 Binary Classification Problem
‘Spam’ or ‘Ham’
 Techniques
Naïve Bayesian Classifier
Support Vector Machi...
Spam Filtering in Practice
Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, I...
Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de
Naïve Bayesian Classifier
 Bayes Theorem
 Naïve?
Bayesian Theorem with string independence assumption
 Classifier ignor...
 Example
1. 상자 A가 선택될 확률 P( A ) = 7 / 10
2. 상자 A에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10
3. 주머니에서는 A, 상자 A에서 흰공 뽑힐 확률
4. 흰공의 확률
❶ ❷
 Example ❶ ❷
어디선가 흰공이 나왔는데… P( A | 흰공 )A에서 나왔을 확률?
B에서 나왔을 확률? P( B | 흰공 )
P( A | 흰공 ) = ?
 Bayes Rule
❶ Conditional Prob. A given B ❷ Conditional Prob. B given A
❸ Bayes Rule
 Document Representation Extracting words from document
Implementation: Preparation
Implementation: Preparation
 Representation of Classifier
{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}...
 How to access dict
Implementation: Preparation
 Training
Implementation: Preparation
 Result
Implementation: Preparation
Recall
 Bayesian Theorem
p( category | doc ) =
p( doc )
p( doc | category ) * p( category)
Implementation : Classifier
 P( feature | category ) as prior
 Assumed Probability to resolve data sparseness
Implementation : Classifier
 Results
Implementation : Classifier
 P( document | category ) as likelihood
Implementation : Classifier
 P( document | category ) * p( category )
Implementation : Classifier
 Classifying
Implementation : Classifier
 Result
Implementation : Classifier
 Recall: Naïve Bayesian Classifier
Fisher’s Method
 Fisher’s Method
First, p( document| category ) =
p( feature_1| categ...
 Q&A
Thank You
Upcoming SlideShare
Loading in …5
×

집단지성프로그래밍 - 6장 문서 필터링

606 views

Published on

세미나 자료

Published in: Technology
  • Login to see the comments

집단지성프로그래밍 - 6장 문서 필터링

  1. 1. 문서 필터링 집단지성 프로그래밍 Ch.6 허윤
  2. 2. Document Filtering  Filtering == Classification Problem Data Mining Problem EstimationClassification Predication Clustering Description Affinity Grouping  Document? A set of feature -> text document, image, etc. p( document ) = ?
  3. 3. Spam Filtering  Binary Classification Problem ‘Spam’ or ‘Ham’  Techniques Naïve Bayesian Classifier Support Vector Machine Decision Tree  Rule vs. Model pros and cons
  4. 4. Spam Filtering in Practice Referred at: Sahil Puri1 et al, “COMPARISON AND ANALYSIS OF SPAM DETECTION ALGORITHMS”, 2013, IJAIEM
  5. 5. Referred at: Rene, “New insights into Gmail’s spam filtering”, 2012, emailmarketingtipps.de
  6. 6. Naïve Bayesian Classifier  Bayes Theorem  Naïve? Bayesian Theorem with string independence assumption  Classifier ignore evidence term Posterior1 > posterio2 Posterior1 < posterio2
  7. 7.  Example 1. 상자 A가 선택될 확률 P( A ) = 7 / 10 2. 상자 A에서 흰공 뽑힐 확률 P( 흰공 | A )= 2 / 10 3. 주머니에서는 A, 상자 A에서 흰공 뽑힐 확률 4. 흰공의 확률 ❶ ❷
  8. 8.  Example ❶ ❷ 어디선가 흰공이 나왔는데… P( A | 흰공 )A에서 나왔을 확률? B에서 나왔을 확률? P( B | 흰공 ) P( A | 흰공 ) = ?
  9. 9.  Bayes Rule ❶ Conditional Prob. A given B ❷ Conditional Prob. B given A ❸ Bayes Rule
  10. 10.  Document Representation Extracting words from document Implementation: Preparation
  11. 11. Implementation: Preparation  Representation of Classifier {'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}} # getwords
  12. 12.  How to access dict Implementation: Preparation
  13. 13.  Training Implementation: Preparation
  14. 14.  Result Implementation: Preparation
  15. 15. Recall  Bayesian Theorem p( category | doc ) = p( doc ) p( doc | category ) * p( category)
  16. 16. Implementation : Classifier  P( feature | category ) as prior
  17. 17.  Assumed Probability to resolve data sparseness Implementation : Classifier
  18. 18.  Results Implementation : Classifier
  19. 19.  P( document | category ) as likelihood Implementation : Classifier
  20. 20.  P( document | category ) * p( category ) Implementation : Classifier
  21. 21.  Classifying Implementation : Classifier
  22. 22.  Result Implementation : Classifier
  23. 23.  Recall: Naïve Bayesian Classifier Fisher’s Method  Fisher’s Method First, p( document| category ) = p( feature_1| category ) * p( feature_2| category ) … * p( feature_N| category ) p( category | document ) ?? p( category | feature ) = # of documents having feature in category # of documents having feature
  24. 24.  Q&A Thank You

×