Your SlideShare is downloading. ×
  • Like
Document Classification In PHP
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Document Classification In PHP


Note: slightly updated version of these slides are: …

Note: slightly updated version of these slides are:

This talk discusses how PHP and open source tools can be used to group and classify data for a whole host of applications, including information retrieval, data mining and more.

Published in Technology , News & Politics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • ocument Classification In PHP
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Hello
  • This is just a quick overview of what we’ll be talking about today

  • Lots of python/java no PHP classifiers
    Is it too hardcore? No, algorithms are easy. Widely applicable.
    So what is it? Assign documents labels from predefined set.
    Labels can be anything - topic words, non-topic words, metadata whatever
    Documents in this case is text, web pages, emails, books
    But it can be really anything as long as you can extract features from it

  • Classification is really organising of information - do it every day
    Lots of uses, these are main ones according to me.
    Might do all three with uploading photos to flickr or facebook
    Filter, get rid of bad ones.
    Organise, upload to album or set
    Tag photos with people in them etc.
  • Filtering is Class OR Not Class - generally you then hide or remove one lot
    Binary classification - can break down most things to series of
    In flickr example, what is good?
    - photographer, composition, light etc.
    - some people, friends look good
    - some people, friends look bad
  • Organising is putting document in one place - one label chosen from a set of many possible
    Single label only (often EXACTLY 1, 0 not allowed)
    Folders, albums, libraries, handwriting recognition

  • Tagging, can have multiple, often 0 - many labels
    Often for tagging topics in content
    E.g. a us-china embargo WTO talk might be filed under, US, China, Trade
  • 80’s people would come up with rules
    Then computers would apply rules
    IF this word AND this WORD then this category
    Took a lot of time
    Needed knowledge engineer to get knowledge out of expert into rules
    Didn’t scale, needed more experts for new categories
    Subjective - experts disagree
    Usually result was 60%-90% accurate

  • Machine Learning people said - ‘look at data’ - Supervised Learning
    Work out rules based on manually classified examples
    Scales better, is cheaper, and about as accurate!
    Only need people to make examples, don’t have to be able to explain their process
    Look at the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
  • So what do you need?
    1. the classes to classify to
    2. A set of manually classified documents to train the classifier on
    3. A set of manually classified docs to test on
    In some cases may have a third set of docs for validation

  • So how do we test?
    Run the test docs through, and compare manual to automatic judgements
    Here we’ve got a binary classification, for a spam checker
    Top is the manual judgement, vertical is classifier judgement
    Boxes will just be counts of judgements
    Some classifiers give a graded result, some give a yes/no result.
    For graded, we might take the top N judgements, or have a threshold they must achieve
    Either way, in the end we get down to a judgement

  • With that we can calculate some numbers
    Accuracy is just correct percentage
    - not always useful, as we sometimes bias, e.g. FN over FP with spam
    Precision measures how tight our grouping is
    - how much can we trust a positive result being really positive
    Recall measures what percentage of the available positives we captures
    You can have one without the other,
    if you reject all but the ones you’re most sure about, you get good precision
    if you mark all positive everything you have a great recall

  • Because of the balance between of recall and precision, researchers often quote breakeven point
    This is just where recall and precision are equal
    F is a more advanced measure, measuring the overlap between the two sets
    F-Beta just allows weighing precision more than recall, or vice versa.
    If beta = 0.5, recall is half as important as precision, such as with spam checker
    If beta = 1, then both are equally important
    There is also an E measure which is just it’s complement, 1 - F measure

  • Before we do classifying, we need to choose a way to represent text for some classifiers - indexing
    All this work is classic Information Retrieval
    Bag of Words is so called because we discard the structure, and just note down appearances of words
    Throw away the ordering, any structure at all from web pages etc.

  • First we have to get the words
    We can use a variety of methods for extracting tokens
    About the simplest would probably be something like this
    We dump all punctuation, everything but basic letters, and split on whitespace.
    For email, Pear::Mail_mimeDecode is good for extracting the message body

    We then represent each document as an array, where keys are all terms from all docs
    And values are whether that particular term present in this particular document
    This is the document vector
  • Here is the collection of these two phrases as a vector.
    1 if the word is in the document, 0 if not

  • You can plot the documents on a graph
    Here the green circle is A the red triangle B
    I’ve bounced up 0 on the graph just to keep it away from the value labels
    So our previous document would actually be a point in 8 dimensional space
    As we have 8 terms
    Simple enough, but what we really want to do is capture a bit more information - a position on each axis
    So instead of storing just presence, we store ‘weight’, the value of the term

  • TFIDF is a classic and very common weight - there are a lot of variations though
    TF is just percentage of document composed of term
    IDF is number of docs divided by number with term
    Gives less common terms a higher weight
    So best is uncommon term that appears a lot
    If we look at term weighting our previous by this

  • The idf means that the ‘i’ and ‘like’ actually disappear here, as they are in all docs
    Normally that wouldn’t quite happen! But it shows they have no value to the document
    Don’t gets weighted higher.
    We’d then usually normalise this, to unit length, to account a bit for doc length differences

  • There are unnecessary terms here though, I and Like
    Most algorithms look at all terms, so the increase number of term dimensions can be a problem

  • The number of dimensions is the whole vocabulary, every words that’s been seen in any document
    DR or term space reduction is all about removing terms that don’t contribute much
    This can often be by a factor of 10 or 100!

  • May have heard of stop words
    Common in search engines of old
    Words like ‘of’ ‘the’ ‘an’ - little to no semantic value to us
    Can use a list of words, or infer it from low idf scores
    Which would also pick up ‘collection’ stop words that are not necessarily english stop words
    E.g. if you were classifying documents about pokemon, the words pokemon would probably appear very frequently, and be of little value

  • Try to come up with ‘root’ word
    Maps lots of different variations onto one term, reducing dimensions
    Result is usually not english, it’s just repeatable

  • Kai-Square - greek not chinese
    Statistical technique - this is an example of one, but there are many, odds ratio, information gain etc.
    Keep only terms which are indicative of one class over another
    We counts up the four values - like truth table from before
    How many spam docs contain term etc.
    Looks for importance of term by class by seeing the difference between expected and actual scores
    Expected values for a cell are rowproduct + colproduct / total
    Then we look at the square of the difference, divided by the expected value
    And add all them up

  • We plug the numbers into this formula, which is a one step way of doing the same thing
    Comes out with a number which isn’t particularly interesting absolutely
    But is interesting relatively
    we can calculate a probability of the events being unrelated using the area from this distribution
    Number is 1DF because there is one variable and one dependent
  • Can work out the probability number from a chi-square distribution
    But for DR, can just use a threshold and remove words with less than that threshold
    P is the chance that variables are independent - so for > 10.83 we are 99.9% certain the variables are dependent, one changes with the other
    OK, so we’ve got a good set of data, now we need a classifier

  • Series of term present/not present questions branches in tree
    Eventually ending in leaf classification nodes - this is a ‘yes or no’ result, there’s is no grading of similarity
    Easy to classify, and building algorithm pretty easy
    If all collection class, then leaf class
    Else, choose the best term to split on,and recurse on each branch
    But how does it determine best?

  • Calculate entropy
    - section could be repeated for multiple classes
    Basically represents how many bits needed to encode the result of a random sequence given this split
    Easier to see on graph

  • If 0 or 1, the sequence is all the same class, so no bits
    If 0.5 it’s 50/50 so you need 1 bit to encode each
    If less than that, you can use shorter codes for more common spam or ham
    And longer for less common, so average bit per item is lower

  • Combine by looking for maximum information gain
    Entropy of current set minus the weighted entropies of the two new sets

  • Final col is just entropy times proportion
    For example, in this example the split looks pretty good
    The with class is very biased one way
    But because it’s smaller the information gain isn’t massive

  • Easy to implement recursive builder
    Gives us a tree in array format, which we could save by serialising
    Just need to traverse to classify

  • An completely made up example of an output tree.

  • Millions of ways to do this, of course
    Simple function to return leaf node
    Assumes document is as array of words

  • Problem: if you go right to the end, the tree will probably be too specific to the training data
    Stop condition - min info gain - or pruning
    Use a separate ‘validation’ set to test effectiveness of tree at different depths
    Choose most effective
    DTs generate human interpretable rules - very handy
    BUT expensive to train, don’t handle loads of dimensions well, and often require rebuilding
  • KNN is much cheaper at training time - as there is no training
    Uses the fact that we can regard these as vectors in a N-dimensional space

  • Lets consider only 2 terms, here we have documents displayed with their weights in terms X and Y
    Documents of class triangle and class circle
    They seem to have a spatial cluster

  • We can work out the class of the new one by looking at it’s nearest neighbours
    The K is how many we look for

  • In this case K is three, and the nearest three, as you can see, are all green circles.
    Choosing K is kind of hard, you might try a few different values but it’s usually in the 10-30 doc range
    Only real challenge is comparing documents
    Here we can see we are looking at just the X and Y distance, this is the euclidean distance

  • Very easy. Simply looking at the difference between one and the other
    Can actually do the whole thing in the database !
    But, has some problems, so more common...

  • Alternative measurement, goes to 1 for identical, 0 for orthogonal, -1 for opposite
    Easy to do with normalised vectors - just take dot product
    Covers some cases euclidean is less good at

  • We’ve got two options when classifying - can count most common as in first loop
    But this system gives us a grading of matching, the distance
    Or we weight on how similar they are - on the assumption the best matches are most indicative
    here we’re just adding the similarity, the closer the match the higher the value
    Could get much more fancy with weighting schemes of course
    In multiple we might take any class that gets over a certain weight in fact.
    But, still a bit of a pain to do in PHP - compare to every training document
    Lots of ways to optimise, because search engines do a very similar job, similarity wise
    Why not use one?

  • Search engines are usually not designed to take whole documents as queries
    So, some fudging needed, like looking at only subject lines
    Not necessarily great results, but very easy to implement
    Good for twitter, or shorter applications perhaps

  • Just implementing K using the result limit
    Will also want to replace ? and * characters
    Or could add terms through the API
    Still, a bit of a sketchy classifier

  • Flax is based on the open source Xapian engine, kind of like their Solr
    Has a similarity search that makes KNN ridiculously easy and very effective
    The version with the PHP client is in SVN trunk at the moment, but is stable

  • This code creates a database, adds two fields to it, and indexes a document

  • Very similar to lucene loop
    Except we add then remove a document to use similarity feature
    Gets good accuracy and is pretty fast.
    However, if we want to use this kind of technique and don’t have a flax handy,
    there is another related technique

  • Instead of taking each value and comparing it
    We take the *average* of all the documents in each class
    And compare against that
    This works surprisingly well!

  • Here we compute the centroid of all the class
    By summing the weights, and multiplying by 1/the count.
    You might do this in the database, pretty straightforward op.

    Called a Rocchio classifier because it’s based on a relevance feedback technique by Rocchio
  • Quick and easy probability based classifier
    Very commonly used in spam checking
    Naive assumption is that words are independent - which is clearly not true
    Means that we don’t need an example for each combination of attributes, which is very helpful for docs!
    Bayes is good at very high dimensionality because of this

  • Take this slow!!
    Read the pipe as ‘given’, pr as probability of
    All classes are using the same doc, and since we only care about most likely, we can drop that bit
    Prob of class is easy, can either work it out as a likelihood or just assume 0.5 (for binary)
    So we just have work out the probability of the document given the class, which we can treat as the product of the likelihood of it’s terms occurring given class

  • We can look at the data itself to calculate the term likelihoods
    Simply looking at the conditional probability, the number of times that the
    term occurs along with the given class divided by the total appearances of that class

  • We can calculate it in a SQL query if storing the data.
    Assuming we’ve stored the total count in the class count, and the class in class

  • The independence assumptions lets us treat that as the product of the probabilities of each individual
    term given class.
    Here we calc it by looping over the terms in a doc, and times it by the prior probability - probably 0.5.

    This is multi-bernoulli bayes, there is also a version multinomial bayes which calculates likelihood based on relative term frequencies. For that we’d raise the likelihood to the power term freq (count), and likelihood is the sum of the counts of that word in each doc in class (+1) divided by the sum of counts of all words in class (+ num terms)

  • To sum up, what we have here handles a wide variety of problems
    The first step is recognising that something is a classification problem
    - context spelling
    - author identification
    - intrusion detection
    - determining genes in DNA sections
    Then you just need to extract features from the docs
    And apply a learner.
    Hope that everyone has this in their mental toolbox for different kinds of challenges
  • Thanks to the people who put their photos on flickr under Creative Commons
    And also thanks to Lorenzo Alberton who gave me advice on this talk
  • Any questions?


  • 1. Document Classification In PHP @ianbarber -
  • 2. Document Classification Defining The Task Document Pre-processing Term Selection Algorithms
  • 3. What is Document Classification?
  • 4. Uses Ian Barber / @ianbarber / Filter Organise Metadata
  • 5. Filtering - Binary Classification
  • 6. Organising -.... Single Label Classification....
  • 7. Metadata - Multiple Label Classification
  • 8. Manual Rules Written Domain Experts
  • 9. Machine Learning -..... Automatically Extract Rules.....
  • 10. Classes Training Test Documents Documents
  • 11. Evaluation spam ham true false spam positive positive false true ham negative negative
  • 12. Measures.... $accuracy = ($tp + $tn) / ($tp + $tn + $fp + $fn); $precision = $tp / ($tp + $fp); $recall = $tp / ($tp + $fn);
  • 13. $beta = 0.5; $f = (($beta + 1) * $precision * $recall) / (($beta * $precision) + $recall) Fβ Measure....
  • 14. Vector Space Model - Bag Of Words
  • 15. $doc = strtolower(strip_tags($doc)); $regex = '/[^a-z0-9']/'; $doc = preg_replace($regex, '', $doc); $words = preg_split('/s+/', $doc); Extract Tokens
  • 16. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 1 1 1 1 0 0 0 0 B 1 0 1 0 1 1 1 1
  • 17. 2.00 1.00 i 0 -1.00 0 0.50 1.00 1.50 2.00 really
  • 18. $tf = $termCount / $wordCount; $idf = log($totalDocs / $docsWithTerm, 2); $tfidf = $tf * $idf; Term Weighting....
  • 19. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.25 0 0.25 0 0 0 0 B 0 0 0 0 0.125 0.125 0.25 0.125
  • 20. A: I really like eggs B: I donʼt like cabbage, and donʼt like stew i really like eggs cabbage and donʼt stew A 0 0.5 0 0.5 0 0 0 0 B 0 0 0 0 0.2 0.2 0.4 0.2
  • 21. Dimensionality Reduction....
  • 22. Stop Words....
  • 23. happening - happen....... happens - happen. ..... happened - happen....... Stemming
  • 24. spam ham term $a $b not term $c $d Chi-Square....
  • 25. $a = $termSpam; $b = $termHam; $c = $restSpam; $d = $restHam; $total = $a + $b + $c + $d; $diff = ($a * $d) - ($c * $b); $chisquare = ( $total * pow($diff, 2 ) / (($a+$c) * ($b+$d) * ($a+$b) * ($c+$d)); Chi-Square 1DF....
  • 26. p chi2. 0.1 2.71. 0.05 3.84. 0.01 6.63. 0.005 7.88. 0.001 10.83. p - Value....
  • 27. Decision Tree - ID3 ? ✔ ? ✖ ✔
  • 28. Entropy.... $entropy = -( ($spam/$total) * log($spam/$total, 2)) -( ($ham/$total) * log($ham/$total, 2));
  • 29. 1.00 0.75 entropy 0.50 0.25 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 spam/total
  • 30. Information Gain.... $gain = $baseEntropy -(($withCount/$total)* $withEntropy ) ( -(($woutCount/$total)* $woutEntropy )
  • 31. Split Entropy Proportion E*P Base 50/50 1 1 1 With 20/5 0.722 0.25 0.1805 Without 30/45 0.97 0.75 0.7275 1 - With - Without = 0.092.
  • 32. function build($tree, $score) { if(!$score[2]) { return 'spam' } else if(!$score[1]) { return 'ham'; } list($trees, $scores, $term) = getMaxGain($tree); return array($term => array( 0 => build($trees[0],$score[0]), 1 => build($trees[1],$score[1]) )); }
  • 33. array('hello' => array( 0 => array('terry' => array ( 0 => 'spam', 1 => array('everybody' => array( 0 => 'ham', 1 => 'spam' ) ) ) ), 1 => 'spam' ) );
  • 34. Classification.... function classify($doc, $tree) { if(is_string($tree)) { return $tree; } $key = key($tree); if(in_array($term, $doc)) { return classify($doc, $tree[$key][0]); } else { return classify($doc, $tree[$key][1]); } }
  • 35. Overfitting:.... Pruning or Stop Conditions....
  • 36. K Nearest Neighbour
  • 37. Spam Term X Ham Term Y
  • 38. Term X Term Y
  • 39. Term X Term Y
  • 40. foreach($doca as $term => $tfidf) { $distance += abs ( $tfidf - $docb[$term] ); } Euclidean Distance....
  • 41. Cosine Similarity.... foreach($doca as $term => $tfidf) { $similarity += floatval($tfidf) * floatval($docb[$term]); }
  • 42. foreach($scores as $s) { $classes[$s['class']]++; } foreach($scores as $s){ $classes[$s['class']] += $s['sim']; } arsort($classes); $class = key($classes); Classifying....
  • 43. Zend_Search_Lucene $index = Zend_Search_Lucene::create($db); $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'class', $class)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'contents', $content)); $index->addDocument($doc);
  • 44. Zend_Search_Lucene::setResultSetLimit(25); $results = $index->find($content); foreach($results as $result) { $classes[$result->class] += 1; } arsort($classes); $class = key($classes); Classifying with ZSL....
  • 45. Flax/Xapian Search Service
  • 46. $flax = new FlaxSearchService('ip:8080'); $db = $flax->createDatabase('test'); $db->addField('class', array( 'store' => true, 'exacttext’ => true)); $db->addField('contents', array( 'store' => false, 'freetext' => array('language'=>'en'))); $db->commit(); $db->addDocument(array( 'class' => $class, 'contents' => $document)); $db->commit();
  • 47. $db->addDocument( array('contents' => $doc), 'foo'); $db->commit(); $results = $db->searchSimilar('foo',0,25); $db->deleteDocument('foo'); $db->commit(); foreach($results['results'] as $r) { if($r['docid'] != 'foo') { $classes[$r['data']['class'][0]] += 1; } } arsort($classes); $class = key($classes);
  • 48. Spam Term X Ham Term Y
  • 49. Prototypes For Rocchio $mul = 1 / $docsInClassCount; foreach($classDocs as $tid => $tfidf) { $prototype[$tid] += $mul * $tfidf; }
  • 50. Naive Bayes - Probability Based Classifier
  • 51. Bayes Theorem Pr(Class Doc) = Pr(Doc Class) * Pr(Class) Pr(Doc) Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
  • 52. Likelihood Of Term Occurring Given Class word spam freq pr(word|spam) ham freq pr(word|ham) register 1757 0.11 246 0.02 sent 487 0.03 4600 0.36
  • 53. Estimating Likelihood $this->db->query(quot; INSERT INTO class_terms (class, term, likelihood) SELECT d.class, d.term, count(*) / quot; . $classCount . quot; FROM documents AS d JOIN document_terms AS dt USING (did) WHERE d.class = 'quot; . $class . quot;'quot; );
  • 54. Classifying A Document foreach($classes as $class) { $prob[$class] = 0.5; // assume prior foreach($document as $term) { $prob[$class] *= $likely[$term][$class]; } } arsort($prob); $class = key($prob);
  • 55. Document Classification Defining The Problem Document Processing Term Selection Algorithm
  • 56. Image Credits Title What is... Filter Organise Metadata Manual Automatic Vector Space Reduction Stemming Stop words Chi-Squared ID3 Overfitting Bayes Conclusion Credits
  • 57. Questions? @ianbarber -