SISY 2008

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    SISY 2008 - Presentation Transcript

    1. Lazy Multigram Learning Environment for ACRS SISY 2008 Zoran Popović shoom013[at]gmail.com Institute for Multidisciplinary Research Belgrade University
    2. Automated Content Recommendation Systems (ACRS) and Information Retrieval  Information Retrieval (IR): ● CBR/CBF (Content Based Retrieval, does not depend on user specifications) ● CF (Collaborative Filtering, depends on user)  automatic content recommendation system is a form of IR and collaborative filtering: huge number of documents (importance of lazy methods), managing such amount of data and meta-data, queries and searching – examples ...  user information need is driving specifications (no free lunch)
    3. k-Nearest Neighbour Method (kNN)  Important notions are: attributes (features mapped to numbers), distance (can be defined with kernel functions) and weights: k ∑ wi yi 1   x q = i=1 f , wi = 2 k d  xq , x i ∑ wi i=1
    4. k-Nearest Neighbour Method (kNN)  Represents one of the basic MBL / IBL and supervised lazy learning methods  Applicable if close instances are classified closely (similar inputs map to similar outputs)  Sensitive to “dimensionality curse”  With a greater training set larger k can be used, but computation becomes more expensive  Speculation: could be coupled with object-attribute-value model (like ontologies, concerning CF) and Interpolative realization of Boolean Algebra (IBA) as a fuzzy model (Description Logic extension is another approach)
    5. SVM classification  One of the methods of maximum margin classification (margins as a hyperplanes)
    6. SVM classification  Linear complexity of training, much better performance than most other methods (including kNN) even with large feature space  Memory is not wasted with large training sets  “Kernel trick” can be used (also with kNN) in order to make non-linear classification  There are some lazy implementations, but the usual approach is batch (eager) learning  Multi-instance learning (MI) is possible with SVM (can significantly improve performance applicable)
    7. N-grams  N-gram: defined as a subsequence of a given sequence of tokens (given an alphabet of tokens)  Each document as a sequence is represented by N-gram  N-gram's frequency in an observed number of sequences (training data) is it's rank  inverse frequency is defined by number of documents in the training data which have the given N-gram
    8. N-grams  N-gram profile – range or subset of ranks for whose N-grams are observed as features  Traditional role of N-grams in NLP, and also successful applications in many other areas  Proposal: using byte alphabet (instead of usual – characters, tags, words and other tokens) -  this paper is giving indications that this could be practical for general content based file classification and with proposed IR model
    9. ngram.jar – generating N-grams java ngram.generator.Arff inDir outfile.arff [options] Options: -l <Lmin> = lower rank bound (default=1) -m <Lmax> = upper rank bound (default=10) -i <invf> = inverse frequency threshold (default=0.34) -N <N> = N-gram order (default=3) -D <depth> = biggest number of N-grams (default=4294967295) -w <url> = use database with jdbc url to write data -r <url> = use database with jdbc url to read arff -u = do not use normalized vectors for output EXAMPLE: . Arff.sh . .\\out.arff -l 1 -m 500 -N 4 -i 0.5 -D 1048576 (subfolders as category names)
    10. ngram.jar – JDBC storage  N-grams stored to database, and read from database into ARFF: . Arff.sh . ./out10.arff -l 0 -m 500 -N 4 -i 0.5 -w jdbc:oracle:thin:test/test@localhost:1521:xe -r jdbc:oracle:thin:test/test@localhost:1521:xe  -N 0 with -r means that multigram output is generated (aggregation of attributes)  MySQL and Oracle supported so far  http://users.hemo.net/shoom/n-gram.zip http://users.hemo.net/shoom/samples.zip http://users.hemo.net/shoom/scripts.zip
    11. Results with N-grams  classes 1-8, 210 files (5MB cca total): 22 exe/com, 23 text, 56 html, 17 pdf, 33 gif/jpg, 19 jar, 30 Word, 13 mail  SVM SMO and 10-fold cross-validation N i-threshold Lmax % correct/ not N-grams seconds 4 0.34 500 98.48 / 1.52 2094646 6.27 4 0.5 500 94.83 / 5.16 2094646 6.16 4 0.5 500 94.83 / 5.16 1048576 6.28 3 0.34 500 97.17 / 2.83 1807820 6.8 4 0.34 800 98.10 / 1.90 2094646 8.14 5 0.34 800 97.12 / 2.88 2247852 8.38 4 0.34 1000 94.76 / 5.24 2094646 8.28 2 0.34 800 92.16 / 7.84 65536 7.25 1 0.34 800 93.10 / 6.90 256 1.73 1 0.66 800 73.68 / 26.32 256 0.77 4 0.25 250 96.37 / 3.63 2094646 6.47
    12. Some good indications about performance with multigrams  Same testing procedure shows always some performance improvements against simple cases with sizes like N=1 and N=2 (and others): N i-threshold Lmax % correct / not seconds 1, 2 0.34 800 92.42 / 7.58 8.73 2, 3 0.34 800, 500 94.25 / 5.75 11.95 3, 4 0.34 800, 500 97.87 / 2.13 14.44 The first multigram set has better performance then just N=2 (92,16%) improved by a lower sized N=1 (though it is trivial and many categories did not have represents in it)
    13. Weka, Data Mining Tool – ARFF (Attribute-Relation File Format)
    14. Example of multi-instance ARFF file with sparse data  @relation rel @attribute bag_id {bag0,bag1,bag2,bag3,bag4,bag5,...2} @attribute bag relational @attribute a1 numeric @attribute a2 numeric .... @end bag @attribute class {1,2,3,4,5,6,7,8} @data bag0,\"{41 0.2148861237401014, 42 0.13430382733756338, 47 0.1074430618700507, ..., 495 0.05372153093502535}\",1 bag1,\"{....}”,8 ....
    15. Weka's SVM MI SMO classifier
    16. Weka – JDBC Horizontal form of data is needed - all attributes in each row ARFF supports data given by sparse vectors (zero values omitted – this also speeds up SVM)
    17. Weka – JDBC <WEKA_HOME>/DatabaseUtils.props: ... jdbcDriver=...org.gjt.mm.mysql.Driver,oracle.jdbc.driver.OracleDriver ... CHAR=0 ... VARCHAR=0 VARCHAR2=0 ... NUMBER=7 ....
    18. Weka command line • Training a model: java -cp weka.jar weka.classifiers.mi.MISMO -C 1.0 \\ -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \\ “weka.classifiers.mi.supportVector.MIPolyKernel \\ -C 250007 -E 1.0” -t test/out4.arff -d m4.model • Invoking classification for 3rd attribute based on this model (unknown values can be given with “?”): java -cp weka.jar weka.classifiers.mi.MISMO \\ -p 3 -l m4.model -T test.arff • Filters, converters and other tools: java weka.core.converters.CSVLoader data.csv > data.arff java weka.filters.unsupervised.instance.RemoveFolds \\ -i data/soybean.arff -o soybean[-train.arff] -c last \\ -N 4 -F 1 [-V]
    19. PROFILES: NSHARED SQL and vertical Column N ame Data T ype N NUMBER(1,0) P rimary Key 1 N-gram storage NGRAM COUNT VARCHAR2(16) NUMBER 2 - NGRAMS: TSHARED Column N ame Data T ype P rimary Key N NUMBER(1,0) 1 NGRAM VARCHAR2(16) 2 BAG_ID NUMBER 3 CATEGORY VARCHAR2(16) - COUNT NUMBER -
    20. SQL and data transformation Data.sql – generating VIEW V_HDATA for horizontal form of data (quite slow): create or replace view v_data as •select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N •from tshared TS, (select rownum ­ (select count(bag_id) •  from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram  •     from (select N, ngram, count from nshared order by N asc, count desc) T0) T •where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between 0 and 500 •and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) •from tshared) <= 0.5 •order by TS.bag_id asc, TS.N asc, T.rank asc; •/ •create or replace function nulify(n number) return number as       •begin •  if n is null then return 0;  •  else return n;  •  end if; •end; •/ •set pagesize 1500 •set echo off •set serveroutput on •spool query.sql •declare •  CURSOR C IS •    select •     'nulify((select count from v_data vd where vd.bag_id=V.bag_id and rank='||rownum||')) A_'||rownum||', ' txt •    from dba_objects where rownum<=&HOW_MANY_COLS; •begin •  dbms_output.enable(32767); •  dbms_output.put_line('create view V_HDATA as SELECT V.BAG_ID,'); •  for n IN C loop •     dbms_output.put_line(n.txt); •  end loop; •  dbms_output.put_line('(select max(V.CATEGORY) from v_data vd where vd.bag_id=V.bag_id) category ' || •               'from (select distinct T.bag_id, T.category from tshared T order by 1) V;'); •end; •/ •spool off •@query.sql •set echo on
    21. SQL and data transformation Data2.sql – cursor function for a query returning horizontal form of data: create or replace function fewcols(p_lmin number, p_lmax number, p_invf number, p_bagid number,    p_norm boolean default true) return sys_refcursor is str varchar2(32000); cat varchar2(64) := null; opt  sys_refcursor; j number; n number; norm number := 1; cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram       from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid order by TS.bag_id asc, TS.N asc, T.rank asc; cursor CS(lmin number, lmax number, invf number, bagid number) is select sqrt(sum((T.rank­1)*(T.rank­1))) norm from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram  from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid; begin   str := to_char(p_bagid)||' bag_id'; j:=p_lmin;   if p_norm then     open CS(p_lmin,p_lmax,p_invf,p_bagid);     fetch CS into norm;     if norm=0 then norm:=1; end if;     close CS;   end if;   for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop     if cat is null then cat := i.category; end if;     if j<i.rank then       for n in j .. i.rank­1 loop str := str || ', 0 A' || to_char(n); end loop;     end if;     str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank);     j := i.rank+1;   end loop;   if cat is null or j<=p_lmax then     select distinct category into cat from tshared where bag_id=p_bagid;     for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;   end if;   str := str || ', ''' || cat || ''' category'; open opt for 'select ' || str || ' from dual';   return opt; end; /
    22. SQL and data transformation Data3.sql – procedure generating table DATA3 in horizontal form: create or replace procedure data(p_lmin number, p_lmax number, p_invf number, p_norm boolean default true) is str varchar2(32000); cat varchar2(64) := null; first boolean := true; p_bagid number; j number; n number; norm number := 1; cursor C(lmin number, lmax number, invf number, bagid number) is select T.rank­1 rank, TS.count count, TS.bag_id, TS.category, TS.N from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram     from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2 where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid order by TS.bag_id asc, TS.N asc, T.rank asc; cursor CS(lmin number, lmax number, invf number, bagid number) is select sqrt(sum((T.rank­1)*(T.rank­1))) norm from tshared TS, (select rownum ­ (select count(bag_id) from tshared ts0 where ts0.N<T0.N) rank0, rownum rank, N, ngram     from (select N, ngram, count from nshared order by N asc, count desc) T0) T where TS.ngram=T.ngram and TS.N=T.N and T.rank0 between lmin and lmax and (select count(bag_id) from tshared TS2    where TS2.ngram=TS.ngram and TS2.N=TS.N)/(select count(distinct bag_id) from tshared) <= invf and TS.bag_id=bagid; cursor B IS select distinct bag_id from tshared; begin   open B; loop     fetch B into p_bagid; exit when B%NOTFOUND; str := to_char(p_bagid)||' bag_id'; j:=p_lmin;     if p_norm then       open CS(p_lmin,p_lmax,p_invf,p_bagid); fetch CS into norm;       if norm=0 then norm:=1; end if; close CS;     end if;     for i in C(p_lmin,p_lmax,p_invf,p_bagid) loop       if cat is null then cat := i.category; end if;       if j<i.rank then for n in j .. i.rank­1 loop str := str || ', 0 A' || to_char(n); end loop; end if;       str := str || ', ' || to_char(i.count/norm,'9999.99999999') || ' A' || to_char(i.rank); j := i.rank+1;     end loop;     if cat is null or j<=p_lmax then       select distinct category into cat from tshared where bag_id=p_bagid;       for n in j .. p_lmax loop str := str || ', 0 A' || to_char(n); end loop;     end if;     str := str || ', ''' || cat || ''' category';     if first then        first := false;        begin execute immediate 'drop table data3'; exception when others then null; end;        execute immediate 'create table data3 as select ' || str || ' from dual';     else        execute immediate 'insert into data3 select ' || str || ' from dual';     end if;   end loop; commit; end; /
    23. Conclusions so far ...  File classification using N-gram representation is practically possible  Aggregation of N-grams into multigrams is like adding features (also used when combining with other methods – in some cases it is useful, not always) but it does not give deteriorated performance  while attribute pruning is almost always useful (eg. random subspace performance, by information gain, or any other available method of pruning)
    24. Questions ... ?
    SlideShare Zeitgeist 2009

    + shoom013shoom013 Nominate

    custom

    163 views, 0 favs, 0 embeds more stats

    Presentetaion about my paper on SISY 2008 about a p more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 163
      • 163 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories