Hello! PSC @ Ibuildings Twitter Email Blog - related posts
This what you need to do to implement a classifier And also our table of contents a note on PHP - Qs at end, but ask syntax qs straight away First, talk about what and why?
What is it - Assign documents to classes from a predefined set Classes can be any label - e.g. topic words, categories Documents in this case is text, web pages, emails, books But it can be really anything as long as you can extract features from it Algos not hard, applicable in all langs. Python/Java have good library versions So - Why do in PHP? Integrate into web apps - WP, Drupal, MediaWiki
Classification is really organising of information - every day Lots of uses - can group into common tasks of filter, organise, add metadata Might do all three with uploading photos to flickr or facebook Filter, get rid of bad ones. Organise, upload to album or set Tag photos with people in them etc.
Filtering is binary - Class OR Not Class - often hide or remove one lot Can break others types down into series of this binary choices often BUT: simple, not easy. In flickr example, what is good? - photographer, composition, light etc. - regular person, contains their friends etc. - SUBJECTIVE
Organising is putting document in one place - one label chosen from a set of many possible Single label only (often EXACTLY 1, 0 not allowed) Folders, albums, libraries, handwriting recognition
Tagging, can have multiple Often 0 to many labels Often for tagging topics in content E.g. a news story on us-china embargo talk might be filed under: US, China, Trade
In 80s people would come up with rules - computers would apply IF this term AND this term THEN this category Took a lot of time - Needed domain expert -Needed knowledge engineer to get knowledge out of expert Hard to scale, need more experts for new cats - Subjective - experts disagree Usually result was 60%-90% accurate
Machine Learning - &#x2018;look at examples&#x2019; - Supervised Learning Work out rules based on manually classified data People don&#x2019;t need to explain their thinking - just organise - easier Scales better, is cheaper, and about as accurate! In the picture, it&#x2019;s easy to see by looking at the groupings what the &#x2018;rule&#x2019; for classifying m&ms is
So what do we need? 1. the classes to classify to 2. A set of manually classified documents to train the classifier on 3. A set of manually classified docs to test on In some cases may have a third set of manually classified docs for validation How do we use these? We train a learner on training data to create a model
Then use the model to classify each test document Compare manual to automatic judgements Here we&#x2019;ve got a binary classification, for a spam checker Top is the manual judgement, side is classifier judgement Boxes will just be counts of judgements With that we can calculate some stats
Accuracy is just correct percentage - BUT big biased sets make &#x2018;no to all&#x2019; to all accurate. Or we desire bias e.g. FN over FP with spam Precision measures the number of positives that are true positives Recall measures what percentage of the available positives we captures Can have one w/out other: high threshold for precision, all positive for recall Researchers quote breakeven or fbeta
To compare classifiers, researchers often quote breakeven point This is just where recall and precision are equal F-Beta allows weighing precision more than recall, or vice versa. Beta = 1 is balanced If beta = 0.5, recall is half as important as precision, such as spam checker Before classifying, we need to extract features. How do we represent text
All this work is classic Information Retrieval Bag of Words is so called because we discard the structure, and just note down appearances of words Throw away the ordering, any structure at all from web pages etc. See why called vector space in a couple of slides
First we have to extract words Simplest version: Take continuous sequences of word character Ignore all punctuation including apostrophes etc. Each new token we find in each document will be added to a dictionary Each document has a vector - there is a dimension for each dictionary word Value is 1 if the document contained that token, 0 if it did not
Here is the collection of these two phrases as a vector. 1 if the word is in the document, 0 if not Note both vectors have the same dimensions In a real document collection there are lots of dimensions!
We can plot the documents on a graph - using 2 terms &#x2018;i&#x2019; and &#x2018;really&#x2019; Here the green circle is document A the red triangle B The documents on the last page are in 8 dimensional space - 8 terms But we want more information - how important a term is to a document Need to capture a position in that dimension other than 0 & 1 A weight
TFIDF is a classic and very common weighting - are a lot of variations though TF is just count of instances of that term in the doc IDF is number of docs divided by number with term Gives less common terms a higher weight So best is uncommon term that appears a lot Lets look at a similar example to before, with some term weights added
The idf means that the &#x2018;i&#x2019; and &#x2018;like&#x2019; actually disappear here In all docs - no distinguishing power - no value to doc Don&#x2019;t gets weighted higher in B Then normalise to unit length
Normalise is just each value divided by total length (sqrt of the sqrd values) I and Like still 0 though Waste of time processing Maybe there are others that are a waste of time?
DR or term space reduction is removing terms that don&#x2019;t contribute much This can often be by a factor of 10 or 100 Speeds up execution
May have heard of stop words - Common in search engines Words like &#x2018;of&#x2019; &#x2018;the&#x2019; &#x2018;an&#x2019; - or &#x2018;het&#x2019;, &#x2018;de&#x2019; in dutch Little to no semantic value to us Can use a stoplist of words, or infer it from low idf scores Collection stop words Pokemon in english, not a stop word. Pokemon on pokemon forum: stop word.
Try to come up with &#x2018;root&#x2019; word Maps lots of different variations onto one term, reducing dimensions Result is usually not a real world, it&#x2019;s just repeatable
Kai-Square, greek not chinese - Helps choose indicative terms for each class Statistical technique - Calculates how related a term is to a class Take 4 Counts from data. How many spam docs contain term etc. We look for difference between expected and actual counts For a given cell Expected is the row sum * col sum / total Square the difference, divided by expected value, and add all them up
Plug the numbers into this formula: a one step way of doing the same thing Comes out with a number - not interesting absolutely But is interesting relatively Chi-square is a distribution, so we can calculate a probability of the events being unrelated using the area from this distribution 1DF because there is one variable and one dependent (term) (class)
P is the chance that variables are independent For > 10.83 we are 99.9% certain the variables change together Can work out the probability number from a chi-square distribution But for DR, can just use a threshold and remove terms below OK, so we&#x2019;ve got a good set of data, now we need a learner
Tree of has term questions - ends in class decision Easy to classify, and recursive building algorithm pretty easy Algo is: If all collection is class, then leaf of class Else, choose the best term - Split into 2 collections, WITH and WITHOUT term Recurse on each half But how does it determine best?
First, calculate entropy Take counts for how many total docs, how many spam, how many ham minus section could be repeated for multiple classes Represents num bits needed to encode the class of a random choice of document from this set How much new information we get - Easier to see on graph
Percentage of spam on horizontal entropy on vertical If all spam or no spam no entropy - we know what will come out If 50/50 entropy is 1 - we can&#x2019;t guess ahead of time We want to reduce entropy - so that the sets are more consistent
We&#x2019;re using the entropy to calculated the maximum information gain This is the overall reduction in entropy The original entropy minus the new entropy New is weighted by the proportion of docs in each group withCount is the number of docs that have the feature woutCount is the number without, total is the total
The split is how many of each class are in the group The entropy is calculated with the formula before The proportion is just the percentage of the total documents Final col is just entropy times proportion Note that the with class is very biased with a low entropy BUT - only a small proportion, so the final information gain is low
Easy to implement recursive builder If &#x2018;spam&#x2019; or &#x2018;ham&#x2019; are empty - we say the tree is a leaf node. If not, we find the term with the highest info gain And built a subtree based on the set of terms with and without the term Just need to traverse to classify
An completely made up example of an output tree.
Millions of ways to do this, of course! Simple function to return leaf node Assumes document is an array of words
Problem: Tree gets too specific to training data - Need to generalise Stop condition - min info gain or other Pruning - test by trimming off bottom parts of tree Use validation set to test effectiveness of measures DTs generate human interpretable rules - very handy BUT expensive to train, need small N dimension, and often require rebuilding
KNN is much cheaper at training time - as there is no training Recall we can regard documents as vectors in a N-dimensional space Where N is the size of the dictionary
Lets consider only 2 terms Docs with weights for terms X and Y Documents of class triangle and class circle They seem to have a spatial cluster This is also true in higher dimension for real documents
Class of new doc = class of it&#x2019;s K nearest neighbours The K is how many we look for
In this case K is three, and the nearest three are all green circles. Choosing K is kind of hard, you might try a few different values but it&#x2019;s usually in the 11-30 doc range - uneven to num classes Only real challenge is comparing documents Here we are looking at just the X and Y distance, this is the euclidean distance
Very easy. Simply looking at the difference between one and the other Can actually do the whole thing in the database ! But, has some problems, so more common...
Similarity measure, goes to 1 for identical, 0 for orthogonal Easy to do with normalised vectors - just take dot product Multiply each dimension in Doc A with it in Doc B, and sum Provides better matching than Euclidean We could just loop over documents, find K most similar But search engines do a very similar job - why not use one?
Two options when classifying: count most common or add similarities Second helps, e.g. if 5 good matches in class A, 10 poor matches class B For multiple class tagging: use thresholds BUT: Have to compare all documents Search engines do a very similar job, use similar scoring. Why not use one?
We can use Zend Framework native PHP implementation of Lucene We add an unindexed &#x2018;class&#x2019; field, and our contents We would loop over our training data this way, adding documents
Then, we construct a query. Use the same analyser to tokenise documents the same as training data And take a count of how often each word appears We don&#x2019;t have IDF, so we&#x2019;re just filtering short words
Construct a query with the top 50 words by term frequency Results: take the most common class Works OK, not great. Java Lucene, can get a term vector - includes the true weights We aren&#x2019;t limited to using pure PHP search engines though
Flax is based on the open source Xapian engine, kind of like their Solr Has a similarity search that makes KNN ridiculously easy and very effective It works around the same lines as before, but extracting a set of relevant terms from the document or documents in question Weighing scheme is BM25 - more advanced
This code creates a database, adds two fields to it, and indexes a document Uses a restful web service - available from any language
Very similar to lucene loop Except we add then remove a document to use searchSimilar feature Gets good accuracy and is really fast. However, if we want to use this kind of technique and don&#x2019;t have a flax handy, there is another related technique
Instead of taking each value and comparing it We take the average of all the documents in each class And compare against that Very easy This works surprisingly well!
Here we compute the centroid or average of all the class By summing the weight * 1/count. You might do this in the database, pretty straightforward op. Called a rohk-key-oh classifier because it&#x2019;s based on a relevance feedback technique by Rocchio Classify by doing similarity against each - taking closest
Quick and easy probability based classifier Very commonly used in spam checking, very trendy a couple of years back Naive assumption is that words are independent One word does not influence chances of seeing another - not true! BUT: Means that we don&#x2019;t need an example for each combination of attributes Bayes is good at very high dimensionality because of this
This is the Bayes theorem. Read the pipe as &#x2018;given&#x2019;, pr as probability of Pr(Doc) is constant, can be dropped for ranking Pr(Class) is either count or assumed - e.g. 60% spam = 0.6, or just use 0.5/0.5 for binary Have to work out Pr(Doc|Class) We calculate that by looking at the probability of the features in the docs
We can look at the data itself to calculate the term likelihoods Conditional probability: Docs with term in class / Docs in class We had 1757 docs with the word register in the spam class, and about 16,000 docs in the spam class, so the probability is about 0.11. Register is more spam than ham, sent is more ham than spam
Can calculate in SQL directly ClassCount is the number of docs in that class - from earlier query Divide: Number of docs in class containing term / Number of docs in class The stored value is the likelihood of seeing that term in a doc of that class Would call once for each class
Independence assumption lets us treat probability of doc as product of probabilities of word for the given class Loop over the terms and multiply likelihood for each class Assumed prior of 0.5 Multi-bernouli - multinomial is term count in class over overall term count
To sum up, these are the steps for a wide range of problems Step 1: Recognising that something is a classification problem - context spelling, author ident, intrusion detection, find genes in DNA Then extract features from the docs Apply a learner to generate a model for classifying Something for your mental toolbox!
Thanks to the people who put their photos on flickr under Creative Commons
Document Classification In PHP - Slight Return
@ianbarber - firstname.lastname@example.org.......
Deﬁning The Task
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew
i really like eggs cabbage and donʼt stew
A 0 0.58 0 1.58 0 0 0 0
B 0 0 0 0 1.58 1.58 3.16 0.58
C 0 1.17 0 0 0 0 0 0.58
A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew
i really like eggs cabbage and donʼt stew
A 0 0.35 0 0.94 0 0 0 0
B 0 0 0 0 0.31 0.31 0.63 0.11
C 0 0.89 0 0 0 0 0 0.44