Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Document Classification
In PHP


        @ianbarber - ian@ibuildings.com.......
                        http://phpir.com......
Document Classification


Defining The Task
Document Pre-processing
Term Selection
Algorithms
What is
Document Classification?
Uses



 Ian Barber / @ianbarber / ian@ibuildings.com......
 Filter          Organise           Metadata
Filtering -
Binary Classification
Organising -....
Single Label Classification....
Metadata -
Multiple Label Classification
Manual Rules Written
Domain Experts
Machine Learning -.....
Automatically Extract Rules.....
Classes




 Training        Test
Documents     Documents
Evaluation

                 spam       ham

                 true       false
         spam
                positive   po...
Measures....

$accuracy    =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision   = $tp / ($tp + $fp);

$recall      = $tp...
Vector Space Model -
Bag Of Words
$doc   = strtolower(strip_tags($doc));

$regex = '/w+/';
preg_match_all($regex, $doc, $matches);

$words = $matches[0];


...
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew



       i   really like eggs cabbage and donʼt stew

...
2.00




    1.00
i




       0




    -1.00
            0   0.50   1.00     1.50   2.00
                       really
$tf   
       = $termCount;

$idf      
   = log($totalDocs
                    / $docsWithTerm, 2);

$tfidf = $tf * $idf;...
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like egg...
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew


      i really like egg...
Dimensionality Reduction....
Stop Words....
happening - happen.......
                               happens - happen. .....
                             happened - h...
spam   ham
 term       $a    $b
not term    $c    $d




           Chi-Square....
$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);
...
p         chi2.
0.1       2.71.
0.05      3.84.
0.01      6.63.
0.005     7.88.
0.001    10.83.


        p - Value....
Decision Tree - ID3

              ?

        ✔             ?

              ✖           ✔
Entropy....

$entropy =
   -( ($spam/$total)
       * log($spam/$total, 2))
   -( ($ham/$total)
       * log($ham/$total, ...
1.00



          0.75
entropy




          0.50



          0.25



            0
                 0   0.1   0.2   0.3 ...
Information Gain....

    $gain   =
     $baseEntropy
     -(($withCount/$total)* $withEntropy )
(    -(($woutCount/$total...
Split   Entropy Proportion    E*P

 Base     50/50     1         1          1

 With     20/5    0.722      0.25      0.18...
function build($tree) {
  if(!$tree->count('spam')) {
     $tree->setLeaf('ham');
  } else if(!$tree->count('ham')) {
    ...
term


✔          term



     ✖            term



           ✔             ✖
Classification....
function classify($doc, $tree) {
  if($tree->isLeaf()) {
    return $tree->class;
  }
  $term = $tree->g...
Overfitting:....
Pruning or Stop Conditions....
K Nearest Neighbour
Spam
Term X




                         Ham


                Term Y
Term X




         Term Y
Term X




         Term Y
Cosine Similarity....


foreach($doca as $term => $tfidf) {
  $similarity +=
    floatval($tfidf) *
    floatval($docb[$te...
Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
  ...
Zend_Search_Lucene::setResultSetLimit(25);

$analyser =
  Zend_Search_Lucene_Analysis_Analyzer::getDefault();
$tokens = $a...
$q = new Zend_Search_Lucene_Search_Query_MultiTerm();

$tc = 0;
foreach($filtered as $t => $tf) {
  $q->addTerm(
    new Z...
Flax/Xapian Search Service
http://www.flax.co.uk.......
$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
  'store'   ...
$db->addDocument(
        array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$d...
Spam
Term X




                         Ham




                Term Y
Prototypes For Rocchio

$mul = 1 / count($classDocs);

foreach($classDocs as $doc) {
  foreach($doc as $tid => $tfidf) {
 ...
Naive Bayes -
Probability Based Classifier
Bayes Theorem
  Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
                           Pr(Doc)



  Pr(Class Doc) = Pr(Doc C...
Likelihood Of Term Occurring
Given Class

  word      spam freq   pr(word|spam)   ham freq   pr(word|ham)

 register     1...
Estimating Likelihood
$this->db->query("
   INSERT INTO class_terms
       (class, term, likelihood)
   SELECT d.class, d....
Classifying A Document
foreach($classes as $class) {
  $prob[$class] = 0.5; // assume prior

    foreach($document as $ter...
Document Classification


Defining The Problem
Document Processing
Term Selection
Algorithm
Image Credits
Title          http://www.flickr.com/photos/themacinator/3499579760/
What is...     http://www.flickr.com/phot...
Questions?



       @ianbarber - ian@ibuildings.com.......
                       http://phpir.com     .
Upcoming SlideShare
Loading in …5
×

Document Classification In PHP - Slight Return

5,566 views

Published on

A shortened but updated version of my document classification talk given in Feb 2010 at Sogeti Engineering World

Published in: Technology, News & Politics

Document Classification In PHP - Slight Return

  1. 1. Document Classification In PHP @ianbarber - ian@ibuildings.com....... http://phpir.com.......
  2. 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  3. 3. What is Document Classification?
  4. 4. Uses Ian Barber / @ianbarber / ian@ibuildings.com...... Filter Organise Metadata
  5. 5. Filtering - Binary Classification
  6. 6. Organising -.... Single Label Classification....
  7. 7. Metadata - Multiple Label Classification
  8. 8. Manual Rules Written Domain Experts
  9. 9. Machine Learning -..... Automatically Extract Rules.....
  10. 10. Classes Training Test Documents Documents
  11. 11. Evaluation spam ham true false spam positive positive false true ham negative negative
  12. 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  13. 13. Vector Space Model - Bag Of Words
  14. 14. $doc = strtolower(strip_tags($doc)); $regex = '/w+/'; preg_match_all($regex, $doc, $matches); $words = $matches[0]; Extract Tokens
  15. 15. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  16. 16. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  17. 17. $tf = $termCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  18. 18. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.58 0 1.58 0 0 0 0 B 0 0 0 0 1.58 1.58 3.16 0.58 C 0 1.17 0 0 0 0 0 0.58
  19. 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.35 0 0.94 0 0 0 0 B 0 0 0 0 0.31 0.31 0.63 0.11 C 0 0.89 0 0 0 0 0 0.44
  20. 20. Dimensionality Reduction....
  21. 21. Stop Words....
  22. 22. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer .... hhttp://snowball.tartarus.org/algorithms/dutchtml.. Stemming
  23. 23. spam ham term $a $b not term $c $d Chi-Square....
  24. 24. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  25. 25. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  26. 26. Decision Tree - ID3 ? ✔ ? ✖ ✔
  27. 27. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  28. 28. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  29. 29. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  30. 30. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  31. 31. function build($tree) { if(!$tree->count('spam')) { $tree->setLeaf('ham'); } else if(!$tree->count('ham')) { $tree->setLeaf('spam'); } else { $term = $tree->findMaxGain(); $tree->addSubtree($term, build($tree->getWith()), build($tree->getWout()) )); } return $tree; }
  32. 32. term ✔ term ✖ term ✔ ✖
  33. 33. Classification.... function classify($doc, $tree) { if($tree->isLeaf()) { return $tree->class; } $term = $tree->getSplitTerm(); if(in_array($term, $doc)) { return classify($doc, $tree->getWith()); } else { return classify($doc, $tree->getWout()); } }
  34. 34. Overfitting:.... Pruning or Stop Conditions....
  35. 35. K Nearest Neighbour
  36. 36. Spam Term X Ham Term Y
  37. 37. Term X Term Y
  38. 38. Term X Term Y
  39. 39. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  40. 40. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  41. 41. Zend_Search_Lucene::setResultSetLimit(25); $analyser = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); $tokens = $analyser->tokenize($content); foreach($tokens as $key => $token) { $tok = $token->getTermText(); if(strlen($tok) > 4) $filtered[$tok]++; } arsort($filtered); Classifying with ZSL....
  42. 42. $q = new Zend_Search_Lucene_Search_Query_MultiTerm(); $tc = 0; foreach($filtered as $t => $tf) { $q->addTerm( new Zend_Search_Lucene_Index_Term($t)); if(++$tc > 49) { break;} } $results = $index->find($q); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes);
  43. 43. Flax/Xapian Search Service http://www.flax.co.uk.......
  44. 44. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  45. 45. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  46. 46. Spam Term X Ham Term Y
  47. 47. Prototypes For Rocchio $mul = 1 / count($classDocs); foreach($classDocs as $doc) { foreach($doc as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; } }
  48. 48. Naive Bayes - Probability Based Classifier
  49. 49. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  50. 50. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  51. 51. Estimating Likelihood $this->db->query(" INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / " . $classCount . " FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = '" . $class . "'" );
  52. 52. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  53. 53. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  54. 54. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  55. 55. Questions? @ianbarber - ian@ibuildings.com....... http://phpir.com .

×