Document Classification In PHP

Document Classiﬁcation
In PHP

@ianbarber - ian@ibuildings.com.......
http://joind.in/talk/view/587.......


Deﬁning The Task
Document Pre-processing
Term Selection
Algorithms

What is
Document Classiﬁcation?

Uses

Ian Barber / @ianbarber / ian@ibuildings.com......
Filter Organise Metadata

Filtering -
Binary Classiﬁcation

Organising -....
Single Label Classiﬁcation....

Metadata -
Multiple Label Classiﬁcation

Manual Rules Written
Domain Experts

Machine Learning -.....
Automatically Extract Rules.....

Classes

Training Test
Documents Documents

Evaluation

spam ham

true false
spam
positive positive
false true
ham
negative negative

Measures....

$accuracy =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision = $tp / ($tp + $fp);

$recall = $tp / ($tp + $fn);

$beta = 0.5;

$f =
(($beta + 1) * $precision * $recall)
/ (($beta * $precision) + $recall)

Fβ Measure....

Vector Space Model -
Bag Of Words

$doc = strtolower(strip_tags($doc));

$regex = '/[^a-z0-9']/';
$doc = preg_replace($regex, '', $doc);

$words = preg_split('/s+/', $doc);

Extract Tokens

A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew

i really like eggs cabbage and donʼt stew

A 1 1 1 1 0 0 0 0

B 1 0 1 0 1 1 1 1

2.00

1.00
i

0

-1.00
0 0.50 1.00 1.50 2.00
really

$tf
= $termCount / $wordCount;

$idf
= log($totalDocs
/ $docsWithTerm, 2);

$tfidf = $tf * $idf;

Term Weighting....



A 0 0.25 0 0.25 0 0 0 0

B 0 0 0 0 0.125 0.125 0.25 0.125



A 0 0.5 0 0.5 0 0 0 0

B 0 0 0 0 0.2 0.2 0.4 0.2

happening - happen.......
happens - happen. .....
happened - happen.......
http://tartarus.org/~martin/PorterStemmer.......

Stemming

spam ham
term $a $b
not term $c $d

Chi-Square....

$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare = (
$total * pow($diff, 2 ) /
(($a+$c) * ($b+$d) *
($a+$b) * ($c+$d));

Chi-Square 1DF....

p chi2.
0.1 2.71.
0.05 3.84.
0.01 6.63.
0.005 7.88.
0.001 10.83.

p - Value....

Decision Tree - ID3

?

✔ ?

✖ ✔

Entropy....

$entropy =
-( ($spam/$total)
* log($spam/$total, 2))
-( ($ham/$total)
* log($ham/$total, 2));

1.00

0.75
entropy

0.50

0.25

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
spam/total

Information Gain....

$gain =
$baseEntropy
-(($withCount/$total)* $withEntropy )
( -(($woutCount/$total)* $woutEntropy )

Split Entropy Proportion E*P

Base 50/50 1 1 1

With 20/5 0.722 0.25 0.1805

Without 30/45 0.97 0.75 0.7275

1 - With - Without = 0.092.

array('hello' =>
array(
0 => array('terry' =>
array (
0 => 'spam',
1 => array('everybody' =>
array(
0 => 'ham',
1 => 'spam'
)
)
)
),
1 => 'spam'
)
);

Classiﬁcation....
function classify($doc, $tree) {
if(is_string($tree)) {
return $tree;
}
$key = key($tree);
if(in_array($term, $doc)) {
return classify($doc, $tree[$key][0]);
} else {
return classify($doc, $tree[$key][1]);
}
}

Overﬁtting:....
Pruning or Stop Conditions....

Spam
Term X

Ham

Term Y

foreach($doca as $term => $tfidf) {
$distance +=
abs ( $tfidf - $docb[$term] );
}

Euclidean Distance....

Cosine Similarity....

foreach($doca as $term => $tfidf) {
$similarity +=
floatval($tfidf) *
floatval($docb[$term]);
}

foreach($scores as $s) {
$classes[$s['class']]++;
}

foreach($scores as $s){
$classes[$s['class']] += $s['sim'];
}

arsort($classes);
$class = key($classes);

Classifying....

Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
Zend_Search_Lucene_Field::Text(
'class', $class));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'contents', $content));
$index->addDocument($doc);

Zend_Search_Lucene::setResultSetLimit(25);

$results = $index->find($content);
foreach($results as $result) {
$classes[$result->class] += 1;
}

arsort($classes);

Classifying with ZSL....

Flax/Xapian Search Service
http://www.ﬂax.co.uk.......

$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
'store' => true,
'exacttext’ => true));
$db->addField('contents', array(
'store' => false,
'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
'class' => $class,
'contents' => $document));
$db->commit();

$db->addDocument(
array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
if($r['docid'] != 'foo') {
$classes[$r['data']['class'][0]] += 1;
}
}

arsort($classes);

Prototypes For Rocchio

$mul = 1 / $docsInClassCount;

foreach($classDocs as $tid => $tfidf) {
$prototype[$tid] += $mul * $tfidf;
}

Naive Bayes -
Probability Based Classiﬁer

Bayes Theorem
Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Pr(Doc)

Pr(Class Doc) = Pr(Doc Class) * Pr(Class)

Likelihood Of Term Occurring
Given Class

word spam freq pr(word|spam) ham freq pr(word|ham)

register 1757 0.11 246 0.02

sent 487 0.03 4600 0.36

Estimating Likelihood
$this->db->query(quot;
INSERT INTO class_terms
(class, term, likelihood)
SELECT d.class, d.term,
count(*) / quot; . $classCount . quot;
FROM documents AS d
JOIN document_terms AS dt USING (did)
WHERE d.class = 'quot; . $class . quot;'quot;
);

Classifying A Document
foreach($classes as $class) {
$prob[$class] = 0.5; // assume prior

foreach($document as $term) {
$prob[$class] *=
$likely[$term][$class];
}
}

arsort($prob);
$class = key($prob);


Deﬁning The Problem
Document Processing
Term Selection
Algorithm

Image Credits
Title http://www.flickr.com/photos/themacinator/3499579760/
What is... http://www.flickr.com/photos/austinevan/1225274637/
Filter http://www.flickr.com/photos/benimoto/2913950616/
Organise http://www.flickr.com/photos/ellasdad/425813314/
Metadata http://www.flickr.com/photos/banky177/2282734063/
Manual http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3 http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion http://www.flickr.com/photos/mukluk/241256203
Credits http://www.flickr.com/photos/librarianavengers/413762956/

Questions?

@ianbarber - ian@ibuildings.com.......
http://joind.in/talk/view/587.......

Document Classification In PHP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Document Classification In PHP

Similar to Document Classification In PHP (20)

More from Ian Barber

More from Ian Barber (10)

Recently uploaded

Recently uploaded (20)

Document Classification In PHP

Editor's Notes