We have emailed the verification/download link to "".
Login to your email and click the link to download the file directly.
Check your bulk/spam folders if you can't find our mail.
Note: slightly updated version of these slides are: http://www.slideshare.net/IanBarber/document-classification-in-php-slight-return …
Note: slightly updated version of these slides are: http://www.slideshare.net/IanBarber/document-classification-in-php-slight-return
This talk discusses how PHP and open source tools can be used to group and classify data for a whole host of applications, including information retrieval, data mining and more.
Views
Actions
Embeds 0
Report content
Is it too hardcore? No, algorithms are easy. Widely applicable.
So what is it? Assign documents labels from predefined set.
Labels can be anything - topic words, non-topic words, metadata whatever
Documents in this case is text, web pages, emails, books
But it can be really anything as long as you can extract features from it
Lots of uses, these are main ones according to me.
Might do all three with uploading photos to flickr or facebook
Filter, get rid of bad ones.
Organise, upload to album or set
Tag photos with people in them etc.
Binary classification - can break down most things to series of
In flickr example, what is good?
- photographer, composition, light etc.
- some people, friends look good
- some people, friends look bad
Single label only (often EXACTLY 1, 0 not allowed)
Folders, albums, libraries, handwriting recognition
Often for tagging topics in content
E.g. a us-china embargo WTO talk might be filed under, US, China, Trade
Then computers would apply rules
IF this word AND this WORD then this category
Took a lot of time
Needed knowledge engineer to get knowledge out of expert into rules
Didn’t scale, needed more experts for new categories
Subjective - experts disagree
Usually result was 60%-90% accurate
Work out rules based on manually classified examples
Scales better, is cheaper, and about as accurate!
Only need people to make examples, don’t have to be able to explain their process
Look at the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
1. the classes to classify to
2. A set of manually classified documents to train the classifier on
3. A set of manually classified docs to test on
In some cases may have a third set of docs for validation
Run the test docs through, and compare manual to automatic judgements
Here we’ve got a binary classification, for a spam checker
Top is the manual judgement, vertical is classifier judgement
Boxes will just be counts of judgements
Some classifiers give a graded result, some give a yes/no result.
For graded, we might take the top N judgements, or have a threshold they must achieve
Either way, in the end we get down to a judgement
Accuracy is just correct percentage
- not always useful, as we sometimes bias, e.g. FN over FP with spam
Precision measures how tight our grouping is
- how much can we trust a positive result being really positive
Recall measures what percentage of the available positives we captures
You can have one without the other,
if you reject all but the ones you’re most sure about, you get good precision
if you mark all positive everything you have a great recall
This is just where recall and precision are equal
F is a more advanced measure, measuring the overlap between the two sets
F-Beta just allows weighing precision more than recall, or vice versa.
If beta = 0.5, recall is half as important as precision, such as with spam checker
If beta = 1, then both are equally important
There is also an E measure which is just it’s complement, 1 - F measure
All this work is classic Information Retrieval
Bag of Words is so called because we discard the structure, and just note down appearances of words
Throw away the ordering, any structure at all from web pages etc.
We can use a variety of methods for extracting tokens
About the simplest would probably be something like this
We dump all punctuation, everything but basic letters, and split on whitespace.
For email, Pear::Mail_mimeDecode is good for extracting the message body
We then represent each document as an array, where keys are all terms from all docs
And values are whether that particular term present in this particular document
This is the document vector
1 if the word is in the document, 0 if not
Here the green circle is A the red triangle B
I’ve bounced up 0 on the graph just to keep it away from the value labels
So our previous document would actually be a point in 8 dimensional space
As we have 8 terms
Simple enough, but what we really want to do is capture a bit more information - a position on each axis
So instead of storing just presence, we store ‘weight’, the value of the term
TF is just percentage of document composed of term
IDF is number of docs divided by number with term
Gives less common terms a higher weight
So best is uncommon term that appears a lot
If we look at term weighting our previous by this
Normally that wouldn’t quite happen! But it shows they have no value to the document
Don’t gets weighted higher.
We’d then usually normalise this, to unit length, to account a bit for doc length differences
Most algorithms look at all terms, so the increase number of term dimensions can be a problem
DR or term space reduction is all about removing terms that don’t contribute much
This can often be by a factor of 10 or 100!
Common in search engines of old
Words like ‘of’ ‘the’ ‘an’ - little to no semantic value to us
Can use a list of words, or infer it from low idf scores
Which would also pick up ‘collection’ stop words that are not necessarily english stop words
E.g. if you were classifying documents about pokemon, the words pokemon would probably appear very frequently, and be of little value
Maps lots of different variations onto one term, reducing dimensions
Result is usually not english, it’s just repeatable
Statistical technique - this is an example of one, but there are many, odds ratio, information gain etc.
Keep only terms which are indicative of one class over another
We counts up the four values - like truth table from before
How many spam docs contain term etc.
Looks for importance of term by class by seeing the difference between expected and actual scores
Expected values for a cell are rowproduct + colproduct / total
Then we look at the square of the difference, divided by the expected value
And add all them up
Comes out with a number which isn’t particularly interesting absolutely
But is interesting relatively
we can calculate a probability of the events being unrelated using the area from this distribution
Number is 1DF because there is one variable and one dependent
But for DR, can just use a threshold and remove words with less than that threshold
P is the chance that variables are independent - so for > 10.83 we are 99.9% certain the variables are dependent, one changes with the other
OK, so we’ve got a good set of data, now we need a classifier
Eventually ending in leaf classification nodes - this is a ‘yes or no’ result, there’s is no grading of similarity
Easy to classify, and building algorithm pretty easy
Recursive
If all collection class, then leaf class
Else, choose the best term to split on,and recurse on each branch
But how does it determine best?
- section could be repeated for multiple classes
Basically represents how many bits needed to encode the result of a random sequence given this split
Easier to see on graph
If 0.5 it’s 50/50 so you need 1 bit to encode each
If less than that, you can use shorter codes for more common spam or ham
And longer for less common, so average bit per item is lower
Entropy of current set minus the weighted entropies of the two new sets
For example, in this example the split looks pretty good
The with class is very biased one way
But because it’s smaller the information gain isn’t massive
Gives us a tree in array format, which we could save by serialising
Just need to traverse to classify
Simple function to return leaf node
Assumes document is as array of words
Stop condition - min info gain - or pruning
Use a separate ‘validation’ set to test effectiveness of tree at different depths
Choose most effective
DTs generate human interpretable rules - very handy
BUT expensive to train, don’t handle loads of dimensions well, and often require rebuilding
Uses the fact that we can regard these as vectors in a N-dimensional space
Documents of class triangle and class circle
They seem to have a spatial cluster
The K is how many we look for
Choosing K is kind of hard, you might try a few different values but it’s usually in the 10-30 doc range
Only real challenge is comparing documents
Here we can see we are looking at just the X and Y distance, this is the euclidean distance
Can actually do the whole thing in the database !
But, has some problems, so more common...
Easy to do with normalised vectors - just take dot product
Covers some cases euclidean is less good at
But this system gives us a grading of matching, the distance
Or we weight on how similar they are - on the assumption the best matches are most indicative
here we’re just adding the similarity, the closer the match the higher the value
Could get much more fancy with weighting schemes of course
In multiple we might take any class that gets over a certain weight in fact.
But, still a bit of a pain to do in PHP - compare to every training document
Lots of ways to optimise, because search engines do a very similar job, similarity wise
Why not use one?
So, some fudging needed, like looking at only subject lines
Not necessarily great results, but very easy to implement
Good for twitter, or shorter applications perhaps
Will also want to replace ? and * characters
Or could add terms through the API
Still, a bit of a sketchy classifier
Has a similarity search that makes KNN ridiculously easy and very effective
The version with the PHP client is in SVN trunk at the moment, but is stable
Except we add then remove a document to use similarity feature
Gets good accuracy and is pretty fast.
However, if we want to use this kind of technique and don’t have a flax handy,
there is another related technique
We take the *average* of all the documents in each class
And compare against that
This works surprisingly well!
By summing the weights, and multiplying by 1/the count.
You might do this in the database, pretty straightforward op.
Called a Rocchio classifier because it’s based on a relevance feedback technique by Rocchio
Very commonly used in spam checking
Naive assumption is that words are independent - which is clearly not true
Means that we don’t need an example for each combination of attributes, which is very helpful for docs!
Bayes is good at very high dimensionality because of this
Read the pipe as ‘given’, pr as probability of
All classes are using the same doc, and since we only care about most likely, we can drop that bit
Prob of class is easy, can either work it out as a likelihood or just assume 0.5 (for binary)
So we just have work out the probability of the document given the class, which we can treat as the product of the likelihood of it’s terms occurring given class
Simply looking at the conditional probability, the number of times that the
term occurs along with the given class divided by the total appearances of that class
Assuming we’ve stored the total count in the class count, and the class in class
term given class.
Here we calc it by looping over the terms in a doc, and times it by the prior probability - probably 0.5.
This is multi-bernoulli bayes, there is also a version multinomial bayes which calculates likelihood based on relative term frequencies. For that we’d raise the likelihood to the power term freq (count), and likelihood is the sum of the counts of that word in each doc in class (+1) divided by the sum of counts of all words in class (+ num terms)
The first step is recognising that something is a classification problem
- context spelling
- author identification
- intrusion detection
- determining genes in DNA sections
Then you just need to extract features from the docs
And apply a learner.
Hope that everyone has this in their mental toolbox for different kinds of challenges
And also thanks to Lorenzo Alberton who gave me advice on this talk