Document Classification In PHP - Slight Return


Published on

A shortened but updated version of my document classification talk given in Feb 2010 at Sogeti Engineering World

Published in: Technology, News & Politics
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hello!
    PSC @ Ibuildings
    Blog - related posts
  • This what you need to do to implement a classifier
    And also our table of contents
    a note on PHP - Qs at end, but ask syntax qs straight away
    First, talk about what and why?
  • What is it - Assign documents to classes from a predefined set
    Classes can be any label - e.g. topic words, categories
    Documents in this case is text, web pages, emails, books
    But it can be really anything as long as you can extract features from it
    Algos not hard, applicable in all langs. Python/Java have good library versions
    So - Why do in PHP? Integrate into web apps - WP, Drupal, MediaWiki
  • Classification is really organising of information - every day
    Lots of uses - can group into common tasks of filter, organise, add metadata
    Might do all three with uploading photos to flickr or facebook
    Filter, get rid of bad ones.
    Organise, upload to album or set
    Tag photos with people in them etc.
  • Filtering is binary - Class OR Not Class
    - often hide or remove one lot
    Can break others types down into series of this binary choices often
    BUT: simple, not easy. In flickr example, what is good?
    - photographer, composition, light etc.
    - regular person, contains their friends etc. - SUBJECTIVE
  • Organising is putting document in one place - one label chosen from a set of many possible
    Single label only (often EXACTLY 1, 0 not allowed)
    Folders, albums, libraries, handwriting recognition
  • Tagging, can have multiple
    Often 0 to many labels
    Often for tagging topics in content
    E.g. a news story on us-china embargo talk might be filed under: US, China, Trade
  • In 80s people would come up with rules - computers would apply
    IF this term AND this term THEN this category
    Took a lot of time - Needed domain expert -Needed knowledge engineer to get knowledge out of expert
    Hard to scale, need more experts for new cats - Subjective - experts disagree
    Usually result was 60%-90% accurate
  • Machine Learning - ‘look at examples’ - Supervised Learning
    Work out rules based on manually classified data
    People don’t need to explain their thinking - just organise - easier
    Scales better, is cheaper, and about as accurate!
    In the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
  • So what do we need?
    1. the classes to classify to
    2. A set of manually classified documents to train the classifier on
    3. A set of manually classified docs to test on
    In some cases may have a third set of manually classified docs for validation
    How do we use these? We train a learner on training data to create a model
  • Then use the model to classify each test document
    Compare manual to automatic judgements
    Here we’ve got a binary classification, for a spam checker
    Top is the manual judgement, side is classifier judgement
    Boxes will just be counts of judgements
    With that we can calculate some stats
  • Accuracy is just correct percentage - BUT big biased sets make ‘no to all’ to all accurate. Or we desire bias e.g. FN over FP with spam
    Precision measures the number of positives that are true positives
    Recall measures what percentage of the available positives we captures
    Can have one w/out other: high threshold for precision, all positive for recall
    Researchers quote breakeven or fbeta
  • To compare classifiers, researchers often quote breakeven point
    This is just where recall and precision are equal
    F-Beta allows weighing precision more than recall, or vice versa.
    Beta = 1 is balanced
    If beta = 0.5, recall is half as important as precision, such as spam checker
    Before classifying, we need to extract features. How do we represent text
  • All this work is classic Information Retrieval
    Bag of Words is so called because we discard the structure, and just note down appearances of words
    Throw away the ordering, any structure at all from web pages etc.
    See why called vector space in a couple of slides
  • First we have to extract words
    Simplest version: Take continuous sequences of word character
    Ignore all punctuation including apostrophes etc.
    Each new token we find in each document will be added to a dictionary
    Each document has a vector - there is a dimension for each dictionary word
    Value is 1 if the document contained that token, 0 if it did not
  • Here is the collection of these two phrases as a vector.
    1 if the word is in the document, 0 if not
    Note both vectors have the same dimensions
    In a real document collection there are lots of dimensions!
  • We can plot the documents on a graph - using 2 terms ‘i’ and ‘really’
    Here the green circle is document A the red triangle B
    The documents on the last page are in 8 dimensional space - 8 terms
    But we want more information - how important a term is to a document
    Need to capture a position in that dimension other than 0 & 1
    A weight
  • TFIDF is a classic and very common weighting - are a lot of variations though
    TF is just count of instances of that term in the doc
    IDF is number of docs divided by number with term
    Gives less common terms a higher weight
    So best is uncommon term that appears a lot
    Lets look at a similar example to before, with some term weights added
  • The idf means that the ‘i’ and ‘like’ actually disappear here
    In all docs - no distinguishing power - no value to doc
    Don’t gets weighted higher in B
    Then normalise to unit length
  • Normalise is just each value divided by total length
    (sqrt of the sqrd values)
    I and Like still 0 though
    Waste of time processing
    Maybe there are others that are a waste of time?
  • DR or term space reduction is removing terms that don’t contribute much
    This can often be by a factor of 10 or 100
    Speeds up execution
  • May have heard of stop words - Common in search engines
    Words like ‘of’ ‘the’ ‘an’ - or ‘het’, ‘de’ in dutch
    Little to no semantic value to us
    Can use a stoplist of words, or infer it from low idf scores
    Collection stop words
    Pokemon in english, not a stop word. Pokemon on pokemon forum: stop word.
  • Try to come up with ‘root’ word
    Maps lots of different variations onto one term, reducing dimensions
    Result is usually not a real world, it’s just repeatable
  • Kai-Square, greek not chinese - Helps choose indicative terms for each class
    Statistical technique - Calculates how related a term is to a class
    Take 4 Counts from data. How many spam docs contain term etc.
    We look for difference between expected and actual counts
    For a given cell Expected is the row sum * col sum / total
    Square the difference, divided by expected value, and add all them up
  • Plug the numbers into this formula: a one step way of doing the same thing
    Comes out with a number - not interesting absolutely
    But is interesting relatively
    Chi-square is a distribution, so we can calculate a probability of the events being unrelated using the area from this distribution
    1DF because there is one variable and one dependent (term) (class)
  • P is the chance that variables are independent
    For > 10.83 we are 99.9% certain the variables change together
    Can work out the probability number from a chi-square distribution
    But for DR, can just use a threshold and remove terms below
    OK, so we’ve got a good set of data, now we need a learner
  • Tree of has term questions - ends in class decision
    Easy to classify, and recursive building algorithm pretty easy
    Algo is: If all collection is class, then leaf of class
    Else, choose the best term - Split into 2 collections, WITH and WITHOUT term
    Recurse on each half
    But how does it determine best?
  • First, calculate entropy
    Take counts for how many total docs, how many spam, how many ham
    minus section could be repeated for multiple classes
    Represents num bits needed to encode the class of a random choice of document from this set
    How much new information we get - Easier to see on graph
  • Percentage of spam on horizontal entropy on vertical
    If all spam or no spam no entropy - we know what will come out
    If 50/50 entropy is 1 - we can’t guess ahead of time
    We want to reduce entropy - so that the sets are more consistent
  • We’re using the entropy to calculated the maximum information gain
    This is the overall reduction in entropy
    The original entropy minus the new entropy
    New is weighted by the proportion of docs in each group
    withCount is the number of docs that have the feature
    woutCount is the number without, total is the total
  • The split is how many of each class are in the group
    The entropy is calculated with the formula before
    The proportion is just the percentage of the total documents
    Final col is just entropy times proportion
    Note that the with class is very biased with a low entropy
    BUT - only a small proportion, so the final information gain is low
  • Easy to implement recursive builder
    If ‘spam’ or ‘ham’ are empty - we say the tree is a leaf node.
    If not, we find the term with the highest info gain
    And built a subtree based on the set of terms with and without the term
    Just need to traverse to classify
  • An completely made up example of an output tree.
  • Millions of ways to do this, of course!
    Simple function to return leaf node
    Assumes document is an array of words
  • Problem: Tree gets too specific to training data - Need to generalise
    Stop condition - min info gain or other
    Pruning - test by trimming off bottom parts of tree
    Use validation set to test effectiveness of measures
    DTs generate human interpretable rules - very handy
    BUT expensive to train, need small N dimension, and often require rebuilding
  • KNN is much cheaper at training time - as there is no training
    Recall we can regard documents as vectors in a N-dimensional space
    Where N is the size of the dictionary
  • Lets consider only 2 terms
    Docs with weights for terms X and Y
    Documents of class triangle and class circle
    They seem to have a spatial cluster
    This is also true in higher dimension for real documents
  • Class of new doc = class of it’s K nearest neighbours
    The K is how many we look for
  • In this case K is three, and the nearest three are all green circles.
    Choosing K is kind of hard, you might try a few different values but it’s usually in the 11-30 doc range - uneven to num classes
    Only real challenge is comparing documents
    Here we are looking at just the X and Y distance, this is the euclidean distance
  • Very easy. Simply looking at the difference between one and the other
    Can actually do the whole thing in the database !
    But, has some problems, so more common...
  • Similarity measure, goes to 1 for identical, 0 for orthogonal
    Easy to do with normalised vectors - just take dot product
    Multiply each dimension in Doc A with it in Doc B, and sum
    Provides better matching than Euclidean
    We could just loop over documents, find K most similar
    But search engines do a very similar job - why not use one?
  • Two options when classifying: count most common or add similarities
    Second helps, e.g. if 5 good matches in class A, 10 poor matches class B
    For multiple class tagging: use thresholds
    BUT: Have to compare all documents
    Search engines do a very similar job, use similar scoring. Why not use one?
  • We can use Zend Framework native PHP implementation of Lucene
    We add an unindexed ‘class’ field, and our contents
    We would loop over our training data this way, adding documents
  • Then, we construct a query.
    Use the same analyser to tokenise documents the same as training data
    And take a count of how often each word appears
    We don’t have IDF, so we’re just filtering short words
  • Construct a query with the top 50 words by term frequency
    Results: take the most common class
    Works OK, not great.
    Java Lucene, can get a term vector - includes the true weights
    We aren’t limited to using pure PHP search engines though
  • Flax is based on the open source Xapian engine, kind of like their Solr
    Has a similarity search that makes KNN ridiculously easy and very effective
    It works around the same lines as before, but extracting a set of relevant terms from the document or documents in question
    Weighing scheme is BM25 - more advanced
  • This code creates a database, adds two fields to it, and indexes a document
    Uses a restful web service - available from any language
  • Very similar to lucene loop
    Except we add then remove a document to use searchSimilar feature
    Gets good accuracy and is really fast.
    However, if we want to use this kind of technique and don’t have a flax handy, there is another related technique
  • Instead of taking each value and comparing it
    We take the average of all the documents in each class
    And compare against that
    Very easy
    This works surprisingly well!
  • Here we compute the centroid or average of all the class
    By summing the weight * 1/count.
    You might do this in the database, pretty straightforward op.
    Called a rohk-key-oh classifier because it’s based on a relevance feedback technique by Rocchio
    Classify by doing similarity against each - taking closest
  • Quick and easy probability based classifier
    Very commonly used in spam checking, very trendy a couple of years back
    Naive assumption is that words are independent
    One word does not influence chances of seeing another - not true!
    BUT: Means that we don’t need an example for each combination of attributes
    Bayes is good at very high dimensionality because of this
  • This is the Bayes theorem. Read the pipe as ‘given’, pr as probability of
    Pr(Doc) is constant, can be dropped for ranking
    Pr(Class) is either count or assumed - e.g. 60% spam = 0.6, or just use 0.5/0.5 for binary
    Have to work out Pr(Doc|Class)
    We calculate that by looking at the probability of the features in the docs
  • We can look at the data itself to calculate the term likelihoods
    Conditional probability: Docs with term in class / Docs in class
    We had 1757 docs with the word register in the spam class, and about 16,000 docs in the spam class, so the probability is about 0.11.
    Register is more spam than ham, sent is more ham than spam
  • Can calculate in SQL directly
    ClassCount is the number of docs in that class - from earlier query
    Divide: Number of docs in class containing term / Number of docs in class
    The stored value is the likelihood of seeing that term in a doc of that class
    Would call once for each class
  • Independence assumption lets us treat probability of doc as product of probabilities of word for the given class
    Loop over the terms and multiply likelihood for each class
    Assumed prior of 0.5
    Multi-bernouli - multinomial is term count in class over overall term count
  • To sum up, these are the steps for a wide range of problems
    Step 1: Recognising that something is a classification problem
    - context spelling, author ident, intrusion detection, find genes in DNA
    Then extract features from the docs
    Apply a learner to generate a model for classifying
    Something for your mental toolbox!
  • Thanks to the people who put their photos on flickr under Creative Commons
  • Any questions?
  • Document Classification In PHP - Slight Return

    1. 1. Document Classification In PHP @ianbarber -
    2. 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
    3. 3. What is Document Classification?
    4. 4. Uses Ian Barber / @ianbarber / Filter Organise Metadata
    5. 5. Filtering - Binary Classification
    6. 6. Organising -.... Single Label Classification....
    7. 7. Metadata - Multiple Label Classification
    8. 8. Manual Rules Written Domain Experts
    9. 9. Machine Learning -..... Automatically Extract Rules.....
    10. 10. Classes Training Test Documents Documents
    11. 11. Evaluation spam ham true false spam positive positive false true ham negative negative
    12. 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
    13. 13. Vector Space Model - Bag Of Words
    14. 14. $doc = strtolower(strip_tags($doc)); $regex = '/w+/'; preg_match_all($regex, $doc, $matches); $words = $matches[0]; Extract Tokens
    15. 15. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
    16. 16. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
    17. 17. $tf = $termCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
    18. 18. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.58 0 1.58 0 0 0 0 B 0 0 0 0 1.58 1.58 3.16 0.58 C 0 1.17 0 0 0 0 0 0.58
    19. 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew C: I really, really like stew i really like eggs cabbage and donʼt stew A 0 0.35 0 0.94 0 0 0 0 B 0 0 0 0 0.31 0.31 0.63 0.11 C 0 0.89 0 0 0 0 0 0.44
    20. 20. Dimensionality Reduction....
    21. 21. Stop Words....
    22. 22. happening - happen....... happens - happen. ..... happened - happen....... .... h Stemming
    23. 23. spam ham term $a $b not term $c $d Chi-Square....
    24. 24. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
    25. 25. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
    26. 26. Decision Tree - ID3 ? ✔ ? ✖ ✔
    27. 27. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
    28. 28. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
    29. 29. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
    30. 30. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
    31. 31. function build($tree) { if(!$tree->count('spam')) { $tree->setLeaf('ham'); } else if(!$tree->count('ham')) { $tree->setLeaf('spam'); } else { $term = $tree->findMaxGain(); $tree->addSubtree($term, build($tree->getWith()), build($tree->getWout()) )); } return $tree; }
    32. 32. term ✔ term ✖ term ✔ ✖
    33. 33. Classification.... function classify($doc, $tree) { if($tree->isLeaf()) { return $tree->class; } $term = $tree->getSplitTerm(); if(in_array($term, $doc)) { return classify($doc, $tree->getWith()); } else { return classify($doc, $tree->getWout()); } }
    34. 34. Overfitting:.... Pruning or Stop Conditions....
    35. 35. K Nearest Neighbour
    36. 36. Spam Term X Ham Term Y
    37. 37. Term X Term Y
    38. 38. Term X Term Y
    39. 39. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
    40. 40. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
    41. 41. Zend_Search_Lucene::setResultSetLimit(25); $analyser = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); $tokens = $analyser->tokenize($content); foreach($tokens as $key => $token) { $tok = $token->getTermText(); if(strlen($tok) > 4) $filtered[$tok]++; } arsort($filtered); Classifying with ZSL....
    42. 42. $q = new Zend_Search_Lucene_Search_Query_MultiTerm(); $tc = 0; foreach($filtered as $t => $tf) { $q->addTerm( new Zend_Search_Lucene_Index_Term($t)); if(++$tc > 49) { break;} } $results = $index->find($q); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes);
    43. 43. Flax/Xapian Search Service
    44. 44. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
    45. 45. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
    46. 46. Spam Term X Ham Term Y
    47. 47. Prototypes For Rocchio $mul = 1 / count($classDocs); foreach($classDocs as $doc) { foreach($doc as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; } }
    48. 48. Naive Bayes - Probability Based Classifier
    49. 49. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
    50. 50. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
    51. 51. Estimating Likelihood $this->db->query(" INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / " . $classCount . " FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = '" . $class . "'" );
    52. 52. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
    53. 53. Document Classification Defining The Problem Document Processing Term Selection Algorithm
    54. 54. Image Credits Title What is... Filter Organise Metadata Manual Automatic Vector Space Reduction Stemming Stop words Chi-Squared ID3 Overfitting Bayes Conclusion Credits
    55. 55. Questions? @ianbarber - .