Lazy Multigram Learning
Environment for ACRS
SISY 2008
Zoran Popović
shoom013[at]gmail.com
Institute for Multidisciplinary Research
Belgrade University
Automated Content Recommendation
Systems (ACRS) and Information Retrieval
Information Retrieval (IR):
●
CBR/CBF (Content Based Retrieval, does not depend
on user specifications)
●
CF (Collaborative Filtering, depends on user)
automatic content recommendation system is a
form of IR and collaborative filtering: huge
number of documents (importance of lazy
methods), managing such amount of data and
meta-data, queries and searching – examples ...
user information need is driving specifications
(no free lunch)
k-Nearest Neighbour Method
(kNN)
Important notions are: attributes (features
mapped to numbers), distance (can be defined
with kernel functions) and weights:
k
∑ wi yi 1
x q = i=1
f , wi = 2
k
d xq , x i
∑ wi
i=1
k-Nearest Neighbour Method
(kNN)
Represents one of the basic MBL / IBL and supervised
lazy learning methods
Applicable if close instances are classified closely
(similar inputs map to similar outputs)
Sensitive to “dimensionality curse”
With a greater training set larger k can be used, but
computation becomes more expensive
Speculation: could be coupled with object-attribute-value
model (like ontologies, concerning CF) and Interpolative
realization of Boolean Algebra (IBA) as a fuzzy model
(Description Logic extension is another approach)
SVM classification
One of the methods of maximum margin
classification (margins as a hyperplanes)
SVM classification
Linear complexity of training, much better
performance than most other methods (including
kNN) even with large feature space
Memory is not wasted with large training sets
“Kernel trick” can be used (also with kNN) in
order to make non-linear classification
There are some lazy implementations, but the
usual approach is batch (eager) learning
Multi-instance learning (MI) is possible with SVM
(can significantly improve performance applicable)
N-grams
N-gram: defined as a subsequence of a given
sequence of tokens (given an alphabet of
tokens)
Each document as a sequence is represented
by N-gram
N-gram's frequency in an observed number of
sequences (training data) is it's rank
inverse frequency is defined by number of
documents in the training data which have the
given N-gram
N-grams
N-gram profile – range or subset of ranks for
whose N-grams are observed as features
Traditional role of N-grams in NLP, and also
successful applications in many other areas
Proposal: using byte alphabet (instead of usual
– characters, tags, words and other tokens) -
this paper is giving indications that this could be
practical for general content based file
classification and with proposed IR model
ngram.jar – generating N-grams
java ngram.generator.Arff inDir outfile.arff [options]
Options: -l <Lmin> = lower rank bound (default=1)
-m <Lmax> = upper rank bound (default=10)
-i <invf> = inverse frequency threshold (default=0.34)
-N <N> = N-gram order (default=3)
-D <depth> = biggest number of N-grams (default=4294967295)
-w <url> = use database with jdbc url to write data
-r <url> = use database with jdbc url to read arff
-u = do not use normalized vectors for output
EXAMPLE:
. Arff.sh . .\\out.arff -l 1 -m 500 -N 4 -i 0.5 -D 1048576
(subfolders as category names)
ngram.jar – JDBC storage
N-grams stored to database, and read from
database into ARFF:
. Arff.sh . ./out10.arff -l 0 -m 500 -N 4 -i 0.5
-w jdbc:oracle:thin:test/test@localhost:1521:xe
-r jdbc:oracle:thin:test/test@localhost:1521:xe
-N 0 with -r means that multigram output is
generated (aggregation of attributes)
MySQL and Oracle supported so far
http://users.hemo.net/shoom/n-gram.zip
http://users.hemo.net/shoom/samples.zip
http://users.hemo.net/shoom/scripts.zip
Some good indications about
performance with multigrams
Same testing procedure shows always some
performance improvements against simple cases
with sizes like N=1 and N=2 (and others):
N i-threshold Lmax % correct / not seconds
1, 2 0.34 800 92.42 / 7.58 8.73
2, 3 0.34 800, 500 94.25 / 5.75 11.95
3, 4 0.34 800, 500 97.87 / 2.13 14.44
The first multigram set has better performance
then just N=2 (92,16%) improved by a lower
sized N=1 (though it is trivial and many
categories did not have represents in it)
Weka, Data Mining Tool – ARFF
(Attribute-Relation File Format)
Example of multi-instance ARFF
file with sparse data
@relation rel
@attribute bag_id
{bag0,bag1,bag2,bag3,bag4,bag5,...2}
@attribute bag relational
@attribute a1 numeric
@attribute a2 numeric
....
@end bag
@attribute class {1,2,3,4,5,6,7,8}
@data
bag0,\"{41 0.2148861237401014, 42
0.13430382733756338, 47 0.1074430618700507, ...,
495 0.05372153093502535}\",1
bag1,\"{....}”,8
....
Weka's
SVM
MI SMO
classifier
Weka – JDBC
Horizontal
form of data
is needed -
all attributes
in each row
ARFF supports
data given by
sparse vectors
(zero values
omitted – this
also speeds up
SVM)
Weka command line
• Training a model:
java -cp weka.jar weka.classifiers.mi.MISMO -C 1.0 \\
-L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \\
“weka.classifiers.mi.supportVector.MIPolyKernel \\
-C 250007 -E 1.0” -t test/out4.arff -d m4.model
• Invoking classification for 3rd attribute based on
this model (unknown values can be given with “?”):
java -cp weka.jar weka.classifiers.mi.MISMO \\
-p 3 -l m4.model -T test.arff
• Filters, converters and other tools:
java weka.core.converters.CSVLoader data.csv > data.arff
java weka.filters.unsupervised.instance.RemoveFolds \\
-i data/soybean.arff -o soybean[-train.arff] -c last \\
-N 4 -F 1 [-V]
PROFILES: NSHARED
SQL and vertical Column N ame Data T ype
N NUMBER(1,0)
P rimary Key
1
N-gram storage NGRAM
COUNT
VARCHAR2(16)
NUMBER
2
-
NGRAMS: TSHARED
Column N ame Data T ype P rimary Key
N NUMBER(1,0) 1
NGRAM VARCHAR2(16) 2
BAG_ID NUMBER 3
CATEGORY VARCHAR2(16) -
COUNT NUMBER -
SQL and data transformation
Data.sql – generating VIEW V_HDATA for horizontal form of data (quite slow):
create or replace view v_data as
•select T.rank1 rank, TS.count count, TS.bag_id, TS.category, TS.N
•from tshared TS, (select rownum (select count(bag_id)
• from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram
• from (select N, ngram, count from nshared order by N asc, count desc) T0) T
•where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between 0 and 500
•and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id)
•from tshared) <= 0.5
•order by TS.bag_id asc, TS.N asc, T.rank asc;
•/
•create or replace function nulify(n number) return number as
•begin
• if n is null then return 0;
• else return n;
• end if;
•end;
•/
•set pagesize 1500
•set echo off
•set serveroutput on
•spool query.sql
•declare
• CURSOR C IS
• select
• 'nulify((select count from v_data vd where vd.bag_id=V.bag_id and rank='||rownum||')) A_'||rownum||', ' txt
• from dba_objects where rownum<=&HOW_MANY_COLS;
•begin
• dbms_output.enable(32767);
• dbms_output.put_line('create view V_HDATA as SELECT V.BAG_ID,');
• for n IN C loop
• dbms_output.put_line(n.txt);
• end loop;
• dbms_output.put_line('(select max(V.CATEGORY) from v_data vd where vd.bag_id=V.bag_id) category ' ||
• 'from (select distinct T.bag_id, T.category from tshared T order by 1) V;');
•end;
•/
•spool off
•@query.sql
•set echo on
SQL and data transformation
Data2.sql – cursor function for a query returning horizontal form of data:
create or replace function fewcols(p_lmin number, p_lmax number, p_invf number, p_bagid number,
p_norm boolean default true) return sys_refcursor is
str varchar2(32000); cat varchar2(64) := null;
opt sys_refcursor; j number; n number; norm number := 1;
cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank1 rank, TS.count count, TS.bag_id, TS.category, TS.N
from tshared TS, (select rownum (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram
from (select N, ngram, count from nshared order by N asc, count desc) T0) T
where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax
and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id)
from tshared) <= invf and TS.bag_id=bagid order by TS.bag_id asc, TS.N asc, T.rank asc;
cursor CS(lmin number, lmax number, invf number, bagid number) is
select sqrt(sum((T.rank1)*(T.rank1))) norm from tshared TS, (select rownum (select count(bag_id) from tshared ts0
where ts0.N<T0.N) rank0, rownum rank, N, ngram from (select N, ngram, count from nshared order by N asc, count desc) T0) T
where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2
where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid;
begin
str := to_char(p_bagid)||' bag_id'; j:=p_lmin;
if p_norm then
open CS(p_lmin,p_lmax,p_invf,p_bagid);
fetch CS into norm;
if norm=0 then norm:=1; end if;
close CS;
end if;
for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop
if cat is null then cat := i.category; end if;
if j<i.rank then
for n in j .. i.rank1 loop str := str || ', 0 A' || to_char(n); end loop;
end if;
str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank);
j := i.rank+1;
end loop;
if cat is null or j<=p_lmax then
select distinct category into cat from tshared where bag_id=p_bagid;
for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;
end if;
str := str || ', ''' || cat || ''' category'; open opt for 'select ' || str || ' from dual';
return opt;
end;
/
SQL and data transformation
Data3.sql – procedure generating table DATA3 in horizontal form:
create or replace procedure data(p_lmin number, p_lmax number, p_invf number, p_norm boolean default true) is
str varchar2(32000); cat varchar2(64) := null; first boolean := true; p_bagid number; j number; n number; norm number := 1;
cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank1 rank, TS.count count, TS.bag_id, TS.category, TS.N
from tshared TS, (select rownum (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram
from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram
and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram
and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid
order by TS.bag_id asc, TS.N asc, T.rank asc;
cursor CS(lmin number, lmax number, invf number, bagid number) is select sqrt(sum((T.rank1)*(T.rank1))) norm
from tshared TS, (select rownum (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram
from (select N, ngram, count from nshared order by N asc, count desc) T0) T
where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2
where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid;
cursor B IS select distinct bag_id from tshared;
begin
open B; loop
fetch B into p_bagid; exit when B%NOTFOUND; str := to_char(p_bagid)||' bag_id'; j:=p_lmin;
if p_norm then
open CS(p_lmin,p_lmax,p_invf,p_bagid); fetch CS into norm;
if norm=0 then norm:=1; end if; close CS;
end if;
for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop
if cat is null then cat := i.category; end if;
if j<i.rank then for n in j .. i.rank1 loop str := str || ', 0 A' || to_char(n); end loop; end if;
str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank); j := i.rank+1;
end loop;
if cat is null or j<=p_lmax then
select distinct category into cat from tshared where bag_id=p_bagid;
for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;
end if;
str := str || ', ''' || cat || ''' category';
if first then
first := false;
begin execute immediate 'drop table data3'; exception when others then null; end;
execute immediate 'create table data3 as select ' || str || ' from dual';
else
execute immediate 'insert into data3 select ' || str || ' from dual';
end if;
end loop; commit;
end;
/
Conclusions so far ...
File classification using N-gram
representation is practically possible
Aggregation of N-grams into multigrams is
like adding features (also used when
combining with other methods – in some
cases it is useful, not always) but it does not
give deteriorated performance
while attribute pruning is almost always
useful (eg. random subspace performance, by
information gain, or any other available
method of pruning)
0 comments
Post a comment