Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Document Classification In PHP

12,442 views

Published on

Note: slightly updated version of these slides are: http://www.slideshare.net/IanBarber/document-classification-in-php-slight-return

This talk discusses how PHP and open source tools can be used to group and classify data for a whole host of applications, including information retrieval, data mining and more.

Published in: Technology, News & Politics
  • ocument Classification In PHP
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Document Classification In PHP

  1. Document Classification In PHP @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......
  2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  3. What is Document Classification?
  4. Uses Ian Barber / @ianbarber / ian@ibuildings.com...... Filter Organise Metadata
  5. Filtering - Binary Classification
  6. Organising -.... Single Label Classification....
  7. Metadata - Multiple Label Classification
  8. Manual Rules Written Domain Experts
  9. Machine Learning -..... Automatically Extract Rules.....
  10. Classes Training Test Documents Documents
  11. Evaluation spam ham true false spam positive positive false true ham negative negative
  12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  13. $beta = 0.5; $f = (($beta + 1) * $precision * $recall) / (($beta * $precision) + $recall) Fβ Measure....
  14. Vector Space Model - Bag Of Words
  15. $doc = strtolower(strip_tags($doc)); $regex = '/[^a-z0-9']/'; $doc = preg_replace($regex, '', $doc); $words = preg_split('/s+/', $doc); Extract Tokens
  16. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  17. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  18. $tf = $termCount / $wordCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.25 0 0.25 0 0 0 0 B 0 0 0 0 0.125 0.125 0.25 0.125
  20. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.5 0 0.5 0 0 0 0 B 0 0 0 0 0.2 0.2 0.4 0.2
  21. Dimensionality Reduction....
  22. Stop Words....
  23. happening - happen....... happens - happen. ..... happened - happen....... http://tartarus.org/~martin/PorterStemmer....... Stemming
  24. spam ham term $a $b not term $c $d Chi-Square....
  25. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  26. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  27. Decision Tree - ID3 ? ✔ ? ✖ ✔
  28. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  29. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  30. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  31. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  32. function build($tree, $score) { if(!$score[2]) { return 'spam' } else if(!$score[1]) { return 'ham'; } list($trees, $scores, $term) = getMaxGain($tree); return array($term => array( 0 => build($trees[0],$score[0]), 1 => build($trees[1],$score[1]) )); }
  33. array('hello' => array( 0 => array('terry' => array ( 0 => 'spam', 1 => array('everybody' => array( 0 => 'ham', 1 => 'spam' ) ) ) ), 1 => 'spam' ) );
  34. Classification.... function classify($doc, $tree) { if(is_string($tree)) { return $tree; } $key = key($tree); if(in_array($term, $doc)) { return classify($doc, $tree[$key][0]); } else { return classify($doc, $tree[$key][1]); } }
  35. Overfitting:.... Pruning or Stop Conditions....
  36. K Nearest Neighbour
  37. Spam Term X Ham Term Y
  38. Term X Term Y
  39. Term X Term Y
  40. foreach($doca as $term => $tfidf) { $distance += abs ( $tfidf - $docb[$term] ); } Euclidean Distance....
  41. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  42. foreach($scores as $s) { $classes[$s['class']]++; } foreach($scores as $s){ $classes[$s['class']] += $s['sim']; } arsort($classes); $class = key($classes); Classifying....
  43. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  44. Zend_Search_Lucene::setResultSetLimit(25); $results = $index->find($content); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes); Classifying with ZSL....
  45. Flax/Xapian Search Service http://www.flax.co.uk.......
  46. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  47. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  48. Spam Term X Ham Term Y
  49. Prototypes For Rocchio $mul = 1 / $docsInClassCount; foreach($classDocs as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; }
  50. Naive Bayes - Probability Based Classifier
  51. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  52. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  53. Estimating Likelihood $this->db->query(quot; INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / quot; . $classCount . quot; FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = 'quot; . $class . quot;'quot; );
  54. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  55. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  56. Image Credits Title http://www.flickr.com/photos/themacinator/3499579760/ What is... http://www.flickr.com/photos/austinevan/1225274637/ Filter http://www.flickr.com/photos/benimoto/2913950616/ Organise http://www.flickr.com/photos/ellasdad/425813314/ Metadata http://www.flickr.com/photos/banky177/2282734063/ Manual http://www.flickr.com/photos/foundphotoslj/1134150364/ Automatic http://www.flickr.com/photos/29278394@N00/59538978/ Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/ Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/ Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/ Stop words http://www.flickr.com/photos/afroswede/22237769/ Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/ ID3 http://www.flickr.com/photos/tonythemisfit/2414239471 Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/ Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/ Conclusion http://www.flickr.com/photos/mukluk/241256203 Credits http://www.flickr.com/photos/librarianavengers/413762956/
  57. Questions? @ianbarber - ian@ibuildings.com....... http://joind.in/talk/view/587.......

×