Document Classification In PHP

11,630 views

Published on

Note: slightly updated version of these slides are: http://www.slideshare.net/IanBarber/document-classification-in-php-slight-return

This talk discusses how PHP and open source tools can be used to group and classify data for a whole host of applications, including information retrieval, data mining and more.

Published in: Technology, News & Politics
  • ocument Classification In PHP
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Document Classification In PHP

  1. Document Classification In PHP @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......
  2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  3. What is Document Classification?
  4. Uses Ian Barber / @ianbarber / ian@ibuildings.com...... Filter Organise Metadata
  5. Filtering - Binary Classification
  6. Organising -.... Single Label Classification....
  7. Metadata - Multiple Label Classification
  8. Manual Rules Written Domain Experts
  9. Machine Learning -..... Automatically Extract Rules.....
  10. Classes Training Test Documents Documents
  11. Evaluation spam ham true false spam positive positive false true ham negative negative
  12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  13. $beta = 0.5; $f = (($beta + 1) * $precision * $recall) / (($beta * $precision) + $recall) Fβ Measure....
  14. Vector Space Model - Bag Of Words
  15. $doc = strtolower(strip_tags($doc)); $regex = '/[^a-z0-9']/'; $doc = preg_replace($regex, '', $doc); $words = preg_split('/s+/', $doc); Extract Tokens
  16. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  17. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  18. $tf = $termCount / $wordCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.25 0 0.25 0 0 0 0 B 0 0 0 0 0.125 0.125 0.25 0.125
  20. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.5 0 0.5 0 0 0 0 B 0 0 0 0 0.2 0.2 0.4 0.2
  21. Dimensionality Reduction....
  22. Stop Words....
  23. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer....... Stemming
  24. spam ham term $a $b not term $c $d Chi-Square....
  25. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  26. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  27. Decision Tree - ID3 ? ✔ ? ✖ ✔
  28. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  29. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  30. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  31. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  32. function build($tree, $score) { if(!$score[2]) { return 'spam' } else if(!$score[1]) { return 'ham'; } list($trees, $scores, $term) = getMaxGain($tree); return array($term => array( 0 => build($trees[0],$score[0]), 1 => build($trees[1],$score[1]) )); }
  33. array('hello' => array( 0 => array('terry' => array ( 0 => 'spam', 1 => array('everybody' => array( 0 => 'ham', 1 => 'spam' ) ) ) ), 1 => 'spam' ) );
  34. Classification.... function classify($doc, $tree) { if(is_string($tree)) { return $tree; } $key = key($tree); if(in_array($term, $doc)) { return classify($doc, $tree[$key][0]); } else { return classify($doc, $tree[$key][1]); } }
  35. Overfitting:.... Pruning or Stop Conditions....
  36. K Nearest Neighbour
  37. Spam Term X Ham Term Y
  38. Term X Term Y
  39. Term X Term Y
  40. foreach($doca as $term => $tfidf) { $distance += abs ( $tfidf - $docb[$term] ); } Euclidean Distance....
  41. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  42. foreach($scores as $s) { $classes[$s['class']]++; } foreach($scores as $s){ $classes[$s['class']] += $s['sim']; } arsort($classes); $class = key($classes); Classifying....
  43. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  44. Zend_Search_Lucene::setResultSetLimit(25); $results = $index->find($content); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes); Classifying with ZSL....
  45. Flax/Xapian Search Service http://www.flax.co.uk.......
  46. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  47. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  48. Spam Term X Ham Term Y
  49. Prototypes For Rocchio $mul = 1 / $docsInClassCount; foreach($classDocs as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; }
  50. Naive Bayes - Probability Based Classifier
  51. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  52. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  53. Estimating Likelihood $this->db->query(quot; INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / quot; . $classCount . quot; FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = 'quot; . $class . quot;'quot; );
  54. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  55. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  56. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  57. Questions? @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......

×