Your SlideShare is downloading. ×
ETRAN 2008
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

ETRAN 2008

317
views

Published on

A java based n-gram generator in ARFF format for Weka, and how to use it with SVM for document classification

A java based n-gram generator in ARFF format for Weka, and how to use it with SVM for document classification


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
317
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.  ETRAN 2008 Modifikovan metod kNN, SVM i N-grami u sistemima za automatsko preporučivanje Popović Zoran, Centar za Multidisciplinarne studije, Beogradski Univerzitet        shoom013@gmail.com
  • 2. Sistemi za automatsko preporučivanje  (ACRS, i Information Retrieval) • automatic content recommendation systems  (pretraga velikog broja dokumenata,  upravaljanje takvim podacima i meta- podacima, pretrage i upiti) • CBR/CBF (Content Based Retrieval, ne zavisi  od korisnika), CF (Collaborative  Filtering,  zavisi od korisnika) • user information need
  • 3. Metod najbližeg suseda (kNN) • Ocena kategorije prema datim instancama  (skup obučavanja) i ponderima prema datoj  metrici i/ili drugim zahtevima (proširenje  prema CF karakteristikama) • A-O-V sa brojnim vrednostima – alternativno,  IBA fuzzy vrednosti atributa
  • 4. N-grami • N-gram kao podniz date niske nad datim  alfabetom tokena • profili N-grama • frekvencije i inverzne frekvencije • NLP i N-grami, multigrami
  • 5. SMV, MIL, SMO • metode klasifikacije maksimizovanjem  margine
  • 6. ngram.jar • java -Xms1500M -Xmx1500M -cp .ngram.jar ngram.generator.Arff %1 %2 %3 %4 %5 %6 %7 %8 %9 %10 %11 %12 %13 %14 arff.cmd . .out.arff -l 1 -m 500 -N 4 -i 0.5 -D 1048576    (poddirektorijumi sa datotekama kao klase) • http://users.hemo.net/shoom/n-gram.zip http://users.hemo.net/shoom/samples.zip  • http://users.hemo.net/shoom/mustAgent.zip
  • 7. Rezultati • klase 1-8 od 210 fajlova  (oko 5MB total): 22  exe/com, 23 text, 56 html, 17 pdf, 33 gif/jpg,  19 jar, 30 Word, 13 mail) N i-threshold Lmax % correct / not N-grams seconds 4 0.34 500 98.48 / 1.52 2094646 6.27 4 0.5 500 94.83 / 5.16 2094646 6.16 4 0.5 500 94.83 / 5.16 1048576 6.28 3 0.34 500 97.17 / 2.83 1807820 6.8 4 0.34 800 98.10 / 1.90 2094646 8.14 5 0.34 800 97.12 / 2.88 2247852 8.38 4 0.34 1000 94.76 / 5.24 2094646 8.28 2 0.34 800 92.16 / 7.84 65536 7.25
  • 8. Weka, Eclipse
  • 9. Primer ARFF datoteke • @relation rel @attribute bag_id {bag0,bag1,bag2,bag3,bag4,bag5,...2} @attribute bag relational @attribute a1 numeric @attribute a2 numeric .... @end bag @attribute class {1,2,3,4,5,6,7,8} @data bag0,quot;{41 0.2148861237401014, 42 0.13430382733756338, 47 0.1074430618700507, ..., 495 0.05372153093502535}quot;,1 bag1,quot;{....}”,8 ....