Note: slightly updated version of these slides are: http://www.slideshare.net/IanBarber/document-classification-in-php-slight-return
This talk discusses how PHP and open source tools can be used to group and classify data for a whole host of applications, including information retrieval, data mining and more.
19. A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
i really like eggs cabbage and donʼt stew
A 0 0.25 0 0.25 0 0 0 0
B 0 0 0 0 0.125 0.125 0.25 0.125
20. A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
i really like eggs cabbage and donʼt stew
A 0 0.5 0 0.5 0 0 0 0
B 0 0 0 0 0.2 0.2 0.4 0.2
This is just a quick overview of what we’ll be talking about today
Lots of python/java no PHP classifiers
Is it too hardcore? No, algorithms are easy. Widely applicable.
So what is it? Assign documents labels from predefined set.
Labels can be anything - topic words, non-topic words, metadata whatever
Documents in this case is text, web pages, emails, books
But it can be really anything as long as you can extract features from it
Classification is really organising of information - do it every day
Lots of uses, these are main ones according to me.
Might do all three with uploading photos to flickr or facebook
Filter, get rid of bad ones.
Organise, upload to album or set
Tag photos with people in them etc.
Filtering is Class OR Not Class - generally you then hide or remove one lot
Binary classification - can break down most things to series of
In flickr example, what is good?
- photographer, composition, light etc.
- some people, friends look good
- some people, friends look bad
Organising is putting document in one place - one label chosen from a set of many possible
Single label only (often EXACTLY 1, 0 not allowed)
Folders, albums, libraries, handwriting recognition
Tagging, can have multiple, often 0 - many labels
Often for tagging topics in content
E.g. a us-china embargo WTO talk might be filed under, US, China, Trade
80’s people would come up with rules
Then computers would apply rules
IF this word AND this WORD then this category
Took a lot of time
Needed knowledge engineer to get knowledge out of expert into rules
Didn’t scale, needed more experts for new categories
Subjective - experts disagree
Usually result was 60%-90% accurate
Machine Learning people said - ‘look at data’ - Supervised Learning
Work out rules based on manually classified examples
Scales better, is cheaper, and about as accurate!
Only need people to make examples, don’t have to be able to explain their process
Look at the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
So what do you need?
1. the classes to classify to
2. A set of manually classified documents to train the classifier on
3. A set of manually classified docs to test on
In some cases may have a third set of docs for validation
So how do we test?
Run the test docs through, and compare manual to automatic judgements
Here we’ve got a binary classification, for a spam checker
Top is the manual judgement, vertical is classifier judgement
Boxes will just be counts of judgements
Some classifiers give a graded result, some give a yes/no result.
For graded, we might take the top N judgements, or have a threshold they must achieve
Either way, in the end we get down to a judgement
With that we can calculate some numbers
Accuracy is just correct percentage
- not always useful, as we sometimes bias, e.g. FN over FP with spam
Precision measures how tight our grouping is
- how much can we trust a positive result being really positive
Recall measures what percentage of the available positives we captures
You can have one without the other,
if you reject all but the ones you’re most sure about, you get good precision
if you mark all positive everything you have a great recall
Because of the balance between of recall and precision, researchers often quote breakeven point
This is just where recall and precision are equal
F is a more advanced measure, measuring the overlap between the two sets
F-Beta just allows weighing precision more than recall, or vice versa.
If beta = 0.5, recall is half as important as precision, such as with spam checker
If beta = 1, then both are equally important
There is also an E measure which is just it’s complement, 1 - F measure
Before we do classifying, we need to choose a way to represent text for some classifiers - indexing
All this work is classic Information Retrieval
Bag of Words is so called because we discard the structure, and just note down appearances of words
Throw away the ordering, any structure at all from web pages etc.
First we have to get the words
We can use a variety of methods for extracting tokens
About the simplest would probably be something like this
We dump all punctuation, everything but basic letters, and split on whitespace.
For email, Pear::Mail_mimeDecode is good for extracting the message body
We then represent each document as an array, where keys are all terms from all docs
And values are whether that particular term present in this particular document
This is the document vector
Here is the collection of these two phrases as a vector.
1 if the word is in the document, 0 if not
You can plot the documents on a graph
Here the green circle is A the red triangle B
I’ve bounced up 0 on the graph just to keep it away from the value labels
So our previous document would actually be a point in 8 dimensional space
As we have 8 terms
Simple enough, but what we really want to do is capture a bit more information - a position on each axis
So instead of storing just presence, we store ‘weight’, the value of the term
TFIDF is a classic and very common weight - there are a lot of variations though
TF is just percentage of document composed of term
IDF is number of docs divided by number with term
Gives less common terms a higher weight
So best is uncommon term that appears a lot
If we look at term weighting our previous by this
The idf means that the ‘i’ and ‘like’ actually disappear here, as they are in all docs
Normally that wouldn’t quite happen! But it shows they have no value to the document
Don’t gets weighted higher.
We’d then usually normalise this, to unit length, to account a bit for doc length differences
There are unnecessary terms here though, I and Like
Most algorithms look at all terms, so the increase number of term dimensions can be a problem
The number of dimensions is the whole vocabulary, every words that’s been seen in any document
DR or term space reduction is all about removing terms that don’t contribute much
This can often be by a factor of 10 or 100!
May have heard of stop words
Common in search engines of old
Words like ‘of’ ‘the’ ‘an’ - little to no semantic value to us
Can use a list of words, or infer it from low idf scores
Which would also pick up ‘collection’ stop words that are not necessarily english stop words
E.g. if you were classifying documents about pokemon, the words pokemon would probably appear very frequently, and be of little value
Try to come up with ‘root’ word
Maps lots of different variations onto one term, reducing dimensions
Result is usually not english, it’s just repeatable
Kai-Square - greek not chinese
Statistical technique - this is an example of one, but there are many, odds ratio, information gain etc.
Keep only terms which are indicative of one class over another
We counts up the four values - like truth table from before
How many spam docs contain term etc.
Looks for importance of term by class by seeing the difference between expected and actual scores
Expected values for a cell are rowproduct + colproduct / total
Then we look at the square of the difference, divided by the expected value
And add all them up
We plug the numbers into this formula, which is a one step way of doing the same thing
Comes out with a number which isn’t particularly interesting absolutely
But is interesting relatively
we can calculate a probability of the events being unrelated using the area from this distribution
Number is 1DF because there is one variable and one dependent
Can work out the probability number from a chi-square distribution
But for DR, can just use a threshold and remove words with less than that threshold
P is the chance that variables are independent - so for > 10.83 we are 99.9% certain the variables are dependent, one changes with the other
OK, so we’ve got a good set of data, now we need a classifier
Series of term present/not present questions branches in tree
Eventually ending in leaf classification nodes - this is a ‘yes or no’ result, there’s is no grading of similarity
Easy to classify, and building algorithm pretty easy
Recursive
If all collection class, then leaf class
Else, choose the best term to split on,and recurse on each branch
But how does it determine best?
Calculate entropy
- section could be repeated for multiple classes
Basically represents how many bits needed to encode the result of a random sequence given this split
Easier to see on graph
If 0 or 1, the sequence is all the same class, so no bits
If 0.5 it’s 50/50 so you need 1 bit to encode each
If less than that, you can use shorter codes for more common spam or ham
And longer for less common, so average bit per item is lower
Combine by looking for maximum information gain
Entropy of current set minus the weighted entropies of the two new sets
Final col is just entropy times proportion
For example, in this example the split looks pretty good
The with class is very biased one way
But because it’s smaller the information gain isn’t massive
Easy to implement recursive builder
Gives us a tree in array format, which we could save by serialising
Just need to traverse to classify
An completely made up example of an output tree.
Millions of ways to do this, of course
Simple function to return leaf node
Assumes document is as array of words
Problem: if you go right to the end, the tree will probably be too specific to the training data
Stop condition - min info gain - or pruning
Use a separate ‘validation’ set to test effectiveness of tree at different depths
Choose most effective
DTs generate human interpretable rules - very handy
BUT expensive to train, don’t handle loads of dimensions well, and often require rebuilding
KNN is much cheaper at training time - as there is no training
Uses the fact that we can regard these as vectors in a N-dimensional space
Lets consider only 2 terms, here we have documents displayed with their weights in terms X and Y
Documents of class triangle and class circle
They seem to have a spatial cluster
We can work out the class of the new one by looking at it’s nearest neighbours
The K is how many we look for
In this case K is three, and the nearest three, as you can see, are all green circles.
Choosing K is kind of hard, you might try a few different values but it’s usually in the 10-30 doc range
Only real challenge is comparing documents
Here we can see we are looking at just the X and Y distance, this is the euclidean distance
Very easy. Simply looking at the difference between one and the other
Can actually do the whole thing in the database !
But, has some problems, so more common...
Alternative measurement, goes to 1 for identical, 0 for orthogonal, -1 for opposite
Easy to do with normalised vectors - just take dot product
Covers some cases euclidean is less good at
We’ve got two options when classifying - can count most common as in first loop
But this system gives us a grading of matching, the distance
Or we weight on how similar they are - on the assumption the best matches are most indicative
here we’re just adding the similarity, the closer the match the higher the value
Could get much more fancy with weighting schemes of course
In multiple we might take any class that gets over a certain weight in fact.
But, still a bit of a pain to do in PHP - compare to every training document
Lots of ways to optimise, because search engines do a very similar job, similarity wise
Why not use one?
Search engines are usually not designed to take whole documents as queries
So, some fudging needed, like looking at only subject lines
Not necessarily great results, but very easy to implement
Good for twitter, or shorter applications perhaps
Just implementing K using the result limit
Will also want to replace ? and * characters
Or could add terms through the API
Still, a bit of a sketchy classifier
Flax is based on the open source Xapian engine, kind of like their Solr
Has a similarity search that makes KNN ridiculously easy and very effective
The version with the PHP client is in SVN trunk at the moment, but is stable
This code creates a database, adds two fields to it, and indexes a document
Very similar to lucene loop
Except we add then remove a document to use similarity feature
Gets good accuracy and is pretty fast.
However, if we want to use this kind of technique and don’t have a flax handy,
there is another related technique
Instead of taking each value and comparing it
We take the *average* of all the documents in each class
And compare against that
This works surprisingly well!
Here we compute the centroid of all the class
By summing the weights, and multiplying by 1/the count.
You might do this in the database, pretty straightforward op.
Called a Rocchio classifier because it’s based on a relevance feedback technique by Rocchio
Quick and easy probability based classifier
Very commonly used in spam checking
Naive assumption is that words are independent - which is clearly not true
Means that we don’t need an example for each combination of attributes, which is very helpful for docs!
Bayes is good at very high dimensionality because of this
Take this slow!!
Read the pipe as ‘given’, pr as probability of
All classes are using the same doc, and since we only care about most likely, we can drop that bit
Prob of class is easy, can either work it out as a likelihood or just assume 0.5 (for binary)
So we just have work out the probability of the document given the class, which we can treat as the product of the likelihood of it’s terms occurring given class
We can look at the data itself to calculate the term likelihoods
Simply looking at the conditional probability, the number of times that the
term occurs along with the given class divided by the total appearances of that class
We can calculate it in a SQL query if storing the data.
Assuming we’ve stored the total count in the class count, and the class in class
The independence assumptions lets us treat that as the product of the probabilities of each individual
term given class.
Here we calc it by looping over the terms in a doc, and times it by the prior probability - probably 0.5.
This is multi-bernoulli bayes, there is also a version multinomial bayes which calculates likelihood based on relative term frequencies. For that we’d raise the likelihood to the power term freq (count), and likelihood is the sum of the counts of that word in each doc in class (+1) divided by the sum of counts of all words in class (+ num terms)
To sum up, what we have here handles a wide variety of problems
The first step is recognising that something is a classification problem
- context spelling
- author identification
- intrusion detection
- determining genes in DNA sections
Then you just need to extract features from the docs
And apply a learner.
Hope that everyone has this in their mental toolbox for different kinds of challenges
Thanks to the people who put their photos on flickr under Creative Commons
And also thanks to Lorenzo Alberton who gave me advice on this talk