Lazy Multigram Learning
Environment for ACRS
SISY 2008
                           Zoran Popović
                   shoom01...
Automated Content Recommendation
Systems (ACRS) and Information Retrieval

    Information Retrieval (IR):
    ●
        ...
k-Nearest Neighbour Method
(kNN)

    Important notions are: attributes (features
    mapped to numbers), distance (can b...
k-Nearest Neighbour Method
    (kNN)

    Represents one of the basic MBL / IBL and supervised
    lazy learning methods
...
SVM classification




 
     One of the methods of maximum margin
     classification (margins as a hyperplanes)
SVM classification

    Linear complexity of training, much better
    performance than most other methods (including
   ...
N-grams

    N-gram: defined as a subsequence of a given
    sequence of tokens (given an alphabet of
    tokens)

    E...
N-grams

    N-gram profile – range or subset of ranks for
    whose N-grams are observed as features

    Traditional r...
ngram.jar – generating N-grams
java ngram.generator.Arff inDir outfile.arff [options]

Options: -l   <Lmin> = lower rank b...
ngram.jar – JDBC storage

    N-grams stored to database, and read from
    database into ARFF:
    . Arff.sh . ./out10.a...
Results with N-grams
  
      classes 1-8, 210 files (5MB cca total):
      22 exe/com, 23 text, 56 html, 17 pdf,
      3...
Some good indications about
performance with multigrams

    Same testing procedure shows always some
    performance imp...
Weka, Data Mining Tool – ARFF
     (Attribute-Relation File Format)
Example of multi-instance ARFF
file with sparse data
   @relation rel
    @attribute bag_id
    {bag0,bag1,bag2,bag3,bag4...
Weka's
SVM
MI SMO
classifier
Weka – JDBC
              Horizontal
              form of data
              is needed -
              all attributes
   ...
Weka – JDBC

<WEKA_HOME>/DatabaseUtils.props:

...
jdbcDriver=...org.gjt.mm.mysql.Driver,oracle.jdbc.driver.OracleDriver
....
Weka command line
• Training a model:
 java -cp weka.jar weka.classifiers.mi.MISMO -C 1.0 
  -L 0.0010 -P 1.0E-12 -N 0 -V ...
PROFILES: NSHARED
SQL and vertical   Column N ame Data T ype
                   N            NUMBER(1,0)
                 ...
SQL and data transformation
Data.sql – generating VIEW V_HDATA for horizontal form of data (quite slow):
create or replace...
SQL and data transformation
Data2.sql – cursor function for a query returning horizontal form of data:
create or replace f...
SQL and data transformation
Data3.sql – procedure generating table DATA3 in horizontal form:
create or replace procedure d...
Conclusions so far ...

    File classification using N-gram
    representation is practically possible

    Aggregation...
Questions ...



       ?
Upcoming SlideShare
Loading in …5
×

SISY 2008

520 views

Published on

Presentetaion about my paper on SISY 2008 about a prototype of document classification based on multigram profile models and SVM

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
520
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SISY 2008

  1. 1. Lazy Multigram Learning Environment for ACRS SISY 2008 Zoran Popović shoom013[at]gmail.com Institute for Multidisciplinary Research Belgrade University
  2. 2. Automated Content Recommendation Systems (ACRS) and Information Retrieval  Information Retrieval (IR): ● CBR/CBF (Content Based Retrieval, does not depend on user specifications) ● CF (Collaborative Filtering, depends on user)  automatic content recommendation system is a form of IR and collaborative filtering: huge number of documents (importance of lazy methods), managing such amount of data and meta-data, queries and searching – examples ...  user information need is driving specifications (no free lunch)
  3. 3. k-Nearest Neighbour Method (kNN)  Important notions are: attributes (features mapped to numbers), distance (can be defined with kernel functions) and weights: k ∑ wi yi 1   x q = i=1 f , wi = 2 k d  xq , x i ∑ wi i=1
  4. 4. k-Nearest Neighbour Method (kNN)  Represents one of the basic MBL / IBL and supervised lazy learning methods  Applicable if close instances are classified closely (similar inputs map to similar outputs)  Sensitive to “dimensionality curse”  With a greater training set larger k can be used, but computation becomes more expensive  Speculation: could be coupled with object-attribute-value model (like ontologies, concerning CF) and Interpolative realization of Boolean Algebra (IBA) as a fuzzy model (Description Logic extension is another approach)
  5. 5. SVM classification  One of the methods of maximum margin classification (margins as a hyperplanes)
  6. 6. SVM classification  Linear complexity of training, much better performance than most other methods (including kNN) even with large feature space  Memory is not wasted with large training sets  “Kernel trick” can be used (also with kNN) in order to make non-linear classification  There are some lazy implementations, but the usual approach is batch (eager) learning  Multi-instance learning (MI) is possible with SVM (can significantly improve performance applicable)
  7. 7. N-grams  N-gram: defined as a subsequence of a given sequence of tokens (given an alphabet of tokens)  Each document as a sequence is represented by N-gram  N-gram's frequency in an observed number of sequences (training data) is it's rank  inverse frequency is defined by number of documents in the training data which have the given N-gram
  8. 8. N-grams  N-gram profile – range or subset of ranks for whose N-grams are observed as features  Traditional role of N-grams in NLP, and also successful applications in many other areas  Proposal: using byte alphabet (instead of usual – characters, tags, words and other tokens) -  this paper is giving indications that this could be practical for general content based file classification and with proposed IR model
  9. 9. ngram.jar – generating N-grams java ngram.generator.Arff inDir outfile.arff [options] Options: -l <Lmin> = lower rank bound (default=1) -m <Lmax> = upper rank bound (default=10) -i <invf> = inverse frequency threshold (default=0.34) -N <N> = N-gram order (default=3) -D <depth> = biggest number of N-grams (default=4294967295) -w <url> = use database with jdbc url to write data -r <url> = use database with jdbc url to read arff -u = do not use normalized vectors for output EXAMPLE: . Arff.sh . .out.arff -l 1 -m 500 -N 4 -i 0.5 -D 1048576 (subfolders as category names)
  10. 10. ngram.jar – JDBC storage  N-grams stored to database, and read from database into ARFF: . Arff.sh . ./out10.arff -l 0 -m 500 -N 4 -i 0.5 -w jdbc:oracle:thin:test/test@localhost:1521:xe -r jdbc:oracle:thin:test/test@localhost:1521:xe  -N 0 with -r means that multigram output is generated (aggregation of attributes)  MySQL and Oracle supported so far  http://users.hemo.net/shoom/n-gram.zip http://users.hemo.net/shoom/samples.zip http://users.hemo.net/shoom/scripts.zip
  11. 11. Results with N-grams  classes 1-8, 210 files (5MB cca total): 22 exe/com, 23 text, 56 html, 17 pdf, 33 gif/jpg, 19 jar, 30 Word, 13 mail  SVM SMO and 10-fold cross-validation N i-threshold Lmax % correct/ not N-grams seconds 4 0.34 500 98.48 / 1.52 2094646 6.27 4 0.5 500 94.83 / 5.16 2094646 6.16 4 0.5 500 94.83 / 5.16 1048576 6.28 3 0.34 500 97.17 / 2.83 1807820 6.8 4 0.34 800 98.10 / 1.90 2094646 8.14 5 0.34 800 97.12 / 2.88 2247852 8.38 4 0.34 1000 94.76 / 5.24 2094646 8.28 2 0.34 800 92.16 / 7.84 65536 7.25 1 0.34 800 93.10 / 6.90 256 1.73 1 0.66 800 73.68 / 26.32 256 0.77 4 0.25 250 96.37 / 3.63 2094646 6.47
  12. 12. Some good indications about performance with multigrams  Same testing procedure shows always some performance improvements against simple cases with sizes like N=1 and N=2 (and others): N i-threshold Lmax % correct / not seconds 1, 2 0.34 800 92.42 / 7.58 8.73 2, 3 0.34 800, 500 94.25 / 5.75 11.95 3, 4 0.34 800, 500 97.87 / 2.13 14.44 The first multigram set has better performance then just N=2 (92,16%) improved by a lower sized N=1 (though it is trivial and many categories did not have represents in it)
  13. 13. Weka, Data Mining Tool – ARFF (Attribute-Relation File Format)
  14. 14. Example of multi-instance ARFF file with sparse data  @relation rel @attribute bag_id {bag0,bag1,bag2,bag3,bag4,bag5,...2} @attribute bag relational @attribute a1 numeric @attribute a2 numeric .... @end bag @attribute class {1,2,3,4,5,6,7,8} @data bag0,quot;{41 0.2148861237401014, 42 0.13430382733756338, 47 0.1074430618700507, ..., 495 0.05372153093502535}quot;,1 bag1,quot;{....}”,8 ....
  15. 15. Weka's SVM MI SMO classifier
  16. 16. Weka – JDBC Horizontal form of data is needed - all attributes in each row ARFF supports data given by sparse vectors (zero values omitted – this also speeds up SVM)
  17. 17. Weka – JDBC <WEKA_HOME>/DatabaseUtils.props: ... jdbcDriver=...org.gjt.mm.mysql.Driver,oracle.jdbc.driver.OracleDriver ... CHAR=0 ... VARCHAR=0 VARCHAR2=0 ... NUMBER=7 ....
  18. 18. Weka command line • Training a model: java -cp weka.jar weka.classifiers.mi.MISMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K “weka.classifiers.mi.supportVector.MIPolyKernel -C 250007 -E 1.0” -t test/out4.arff -d m4.model • Invoking classification for 3rd attribute based on this model (unknown values can be given with “?”): java -cp weka.jar weka.classifiers.mi.MISMO -p 3 -l m4.model -T test.arff • Filters, converters and other tools: java weka.core.converters.CSVLoader data.csv > data.arff java weka.filters.unsupervised.instance.RemoveFolds -i data/soybean.arff -o soybean[-train.arff] -c last -N 4 -F 1 [-V]
  19. 19. PROFILES: NSHARED SQL and vertical Column N ame Data T ype N NUMBER(1,0) P rimary Key 1 N-gram storage NGRAM COUNT VARCHAR2(16) NUMBER 2 - NGRAMS: TSHARED Column N ame Data T ype P rimary Key N NUMBER(1,0) 1 NGRAM VARCHAR2(16) 2 BAG_ID NUMBER 3 CATEGORY VARCHAR2(16) - COUNT NUMBER -
  20. 20. SQL and data transformation Data.sql – generating VIEW V_HDATA for horizontal form of data (quite slow): create or replace view v_data as •select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N •from tshared TS, (select rownum ­ (select count(bag_id) •  from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram  •     from (select N, ngram, count from nshared order by N asc, count desc) T0) T •where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between 0 and 500 •and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) •from tshared) <= 0.5 •order by TS.bag_id asc, TS.N asc, T.rank asc; •/ •create or replace function nulify(n number) return number as       •begin •  if n is null then return 0;  •  else return n;  •  end if; •end; •/ •set pagesize 1500 •set echo off •set serveroutput on •spool query.sql •declare •  CURSOR C IS •    select •     'nulify((select count from v_data vd where vd.bag_id=V.bag_id and rank='||rownum||')) A_'||rownum||', ' txt •    from dba_objects where rownum<=&HOW_MANY_COLS; •begin •  dbms_output.enable(32767); •  dbms_output.put_line('create view V_HDATA as SELECT V.BAG_ID,'); •  for n IN C loop •     dbms_output.put_line(n.txt); •  end loop; •  dbms_output.put_line('(select max(V.CATEGORY) from v_data vd where vd.bag_id=V.bag_id) category ' || •               'from (select distinct T.bag_id, T.category from tshared T order by 1) V;'); •end; •/ •spool off •@query.sql •set echo on
  21. 21. SQL and data transformation Data2.sql – cursor function for a query returning horizontal form of data: create or replace function fewcols(p_lmin number, p_lmax number, p_invf number, p_bagid number,    p_norm boolean default true) return sys_refcursor is str varchar2(32000); cat varchar2(64) := null; opt  sys_refcursor; j number; n number; norm number := 1; cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram       from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid order by TS.bag_id asc, TS.N asc, T.rank asc; cursor CS(lmin number, lmax number, invf number, bagid number) is select sqrt(sum((T.rank­1)*(T.rank­1))) norm from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram  from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid; begin   str := to_char(p_bagid)||' bag_id'; j:=p_lmin;   if p_norm then     open CS(p_lmin,p_lmax,p_invf,p_bagid);     fetch CS into norm;     if norm=0 then norm:=1; end if;     close CS;   end if;   for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop     if cat is null then cat := i.category; end if;     if j<i.rank then       for n in j .. i.rank­1 loop str := str || ', 0 A' || to_char(n); end loop;     end if;     str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank);     j := i.rank+1;   end loop;   if cat is null or j<=p_lmax then     select distinct category into cat from tshared where bag_id=p_bagid;     for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;   end if;   str := str || ', ''' || cat || ''' category'; open opt for 'select ' || str || ' from dual';   return opt; end; /
  22. 22. SQL and data transformation Data3.sql – procedure generating table DATA3 in horizontal form: create or replace procedure data(p_lmin number, p_lmax number, p_invf number, p_norm boolean default true) is str varchar2(32000); cat varchar2(64) := null; first boolean := true; p_bagid number; j number; n number; norm number := 1; cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram     from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid order by TS.bag_id asc, TS.N asc, T.rank asc; cursor CS(lmin number, lmax number, invf number, bagid number) is select sqrt(sum((T.rank­1)*(T.rank­1))) norm from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram     from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2    where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid; cursor B IS select distinct bag_id from tshared; begin   open B; loop     fetch B into p_bagid; exit when B%NOTFOUND; str := to_char(p_bagid)||' bag_id'; j:=p_lmin;     if p_norm then       open CS(p_lmin,p_lmax,p_invf,p_bagid); fetch CS into norm;       if norm=0 then norm:=1; end if; close CS;     end if;     for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop       if cat is null then cat := i.category; end if;       if j<i.rank then for n in j .. i.rank­1 loop str := str || ', 0 A' || to_char(n); end loop; end if;       str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank); j := i.rank+1;     end loop;     if cat is null or j<=p_lmax then       select distinct category into cat from tshared where bag_id=p_bagid;       for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;     end if;     str := str || ', ''' || cat || ''' category';     if first then        first := false;        begin execute immediate 'drop table data3'; exception when others then null; end;        execute immediate 'create table data3 as select ' || str || ' from dual';     else        execute immediate 'insert into data3 select ' || str || ' from dual';     end if;   end loop; commit; end; /
  23. 23. Conclusions so far ...  File classification using N-gram representation is practically possible  Aggregation of N-grams into multigrams is like adding features (also used when combining with other methods – in some cases it is useful, not always) but it does not give deteriorated performance  while attribute pruning is almost always useful (eg. random subspace performance, by information gain, or any other available method of pruning)
  24. 24. Questions ... ?

×