Weighted Nave Bayes Model for Semi-Structured Document Categorization

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Weighted Nave Bayes Model for Semi-Structured Document Categorization - Presentation Transcript

    1. Weighted Naive Model for Semi- structured Document Categorization Pierre-Francois Marteau,Gildas Menier, Eugen Popovici {pierre-francois.marteau, gildas.menier, eugen.popovici}@univ-ubs.fr pierre- francois.marteau, gildas.menier, eugen.popovici}@univ- VALORIA, Université de Bretagne Sud Vannes - France
    2. Weighted Naive Model for Semi-structured Document Categorization Principles • Supervised classification of semi- structured data Ie. XML document and information retrieval • Extension of bayesian classification • Weighting strategy • Tested on Reuters-21578 Data
    3. Weighted Naive Model for Semi-structured Document Categorization Introduction • Text Classification Bayes, k-nearest neighbours, SVM, Decision Trees, rules etc.. • Semi structured text classification XML Bayes ?
    4. Weighted Naive Model for Semi-structured Document Categorization XML Context Modeling • Document d • Document Object Model : a tree • A set of path from root to a leaf leaf = XML element or text c(n) =< e(n0 ), a (n0 ) >< e(n1 ), a (n1 ) > ... < e(n p ), a (n p ) > • Context c(n) = sequence of elements from root to element (n) • a() : attributes, e() : xml element
    5. Weighted Naive Model for Semi-structured Document Categorization e(l) = leaf = a set of word element vi (lemma, stem, string etc) Each vi is attached to the context c(l) L is a set of vi subelement e(l)={vi} Using this definition of context : Model of Structural Context Augmented Naïve Bayesian Classification (SCANB)
    6. Weighted Naive Model for Semi-structured Document Categorization SCANB Model • A posteriori probability to choose w given a test document d (Bayes): P (d | ω ) * P (ω ) P (ω | d ) = P(d ) • Assimilation of P(d/w) to P(Td/W) where Td is the document object model tree. • Assumptions : SMDC, SVM / Spliiting approaches
    7. Weighted Naive Model for Semi-structured Document Categorization SCANB Model • For SCANB : Two hypothesis • First H1: r=root of tree and {Ti} set of sibling accessible from r P(Td / ω ) =P(r , T1, T2 ,...,Tk / ω ) = P(T1, T2 ,...,Tk / r ω ).P(r / ω ) • Assumption : The order of occurrence of subtree has no importance
    8. Weighted Naive Model for Semi-structured Document Categorization SCANB Model • For SCANB : Two hypothesis • Second H2 : P(Td / ω) =Kw,r .P(T1 / r ω)w1 .P(T2 / r ω)w2 ...P(Tk / r ω)wk .P(r / ω) k = Kw,r .P(r / ω).∏ P(Ti / r ω)wi i =1 • Assumption : {wi} set of positive weighting factors: Kw a normalizing factor.
    9. Weighted Naive Model for Semi-structured Document Categorization SCANB Model • From the two assumptions : Since P(Ti/r w) is decomposable into : ∏ K w, ri .P(ri / r ω ). P(Ti , j / r riω ) wj j ri root of subTree Ti, Tij subtrees accessible from ri : Kw, ri .P(ri / r ω).∏ P(Ti, j / r riω) j = Kw, ri P(ri / c(ri )ω).∏ P(Ti, j / c(rj )ω) w wj j j ∏ End of recursion : wni P (Td / ω ) = K w, ni .P (ni / c (ni ) ω ) n i ∈S d
    10. Weighted Naive Model for Semi-structured Document Categorization Example P (Td / ω ) = P (< FILE , ∅ > ω ). <FILE> <REUTERS ID="21"> K w,reuters .P (< REUTERS, {ID = "21"} > < FILE , ∅ >, ω ). <TITLE> texte1… P (< TITLE , ∅ > < FILE , ∅ >< REUTERS, {ID = "21"} >, ω ) w1 . </TITLE> P (<" texte1 ..." , ∅ > < FILE , ∅ >< REUTERS, {ID = "21"} >< TITLE , ∅ >, ω ) w1 . <BODY> texte2… P (< BODY , ∅ > < FILE , ∅ >< REUTERS, {ID = " 21"} >, ω ) w2 . </BODY> </REUTERS> P (<" texte 2..." , ∅ > < FILE , ∅ >< REUTERS, {ID = "21"} >< BODY , ∅ >, ω ) w2 </FILE> Last assumption H3: as P ( l / c ( l ) ω ) = P ( < e ( l ), a ( l ) > / c ( l ) ω ) = P ( e ( l ) / a ( l ) c ( l ) ω ) . P ( a ( l ) / c ( l ) ω ) = P ({ν i } / a ( l ) c ( l ) ω ) . P ( a ( l ) / c ( nl ) ω ) ∏ P(ν P (l / c (l ) ω ) = P ( a (l ) / c (l ) ω ) / a (l ) c(l ) ω ) i i
    11. Weighted Naive Model for Semi-structured Document Categorization SCANB (with no attributes) • 3 hypothesis: H1, H2, H3 and : ω * = arg max{P (ω / Td )} = arg max{P (Td / ω ).P (ω )} ω∈Ω ω∈Ω ∏ wni P (Td / ω ) = K w, ni .P (ni / c(ni ) ω ) n i ∈S d ∏ P(ν P (l / c(l ) ω ) = / c(l ) ω ) i i • Td the tree for d; ni a node in the tree Td; l a leaf that decomposes into a set of vi; c(ni) is the path from node ni to the root of Td, wni is the weights for node ni.
    12. Weighted Naive Model for Semi-structured Document Categorization SCANB (with no attributes) • Estimation of probabilities P(l/c(l)w) : ∏ (θ ) l ,d N l,d ! l ,ω N i ∏ P(ν i / c(l ) ω ) = i i ∏N l ,d i ! i i • Ni l,d : number of time vi occurs in the XML element attached to l and l ,ω Ni + 1 θil ,ω = l ,ω N + Vl (Probability that vi occurs in l of class w)
    13. Weighted Naive Model for Semi-structured Document Categorization SCANB • In practice : ⎧ Log ( P (ω )) + LK w + ⎫ ⎪ ⎪ ω = arg max ⎨ w .Log ( P ( n / c ( n ) ω )) ⎬ ∑ * ⎪ni∈Sd ni ⎪ ω∈Ω i i ⎩ ⎭ LK w = ∑ Log ( K w,ni ) with ni
    14. Weighted Naive Model for Semi-structured Document Categorization Experimentation • Dataset : Reuters-21578 10 most frequent categories 3964 doc … 286 documents Ω = {Acq, Corn, Crude, Earn, Interest, Ship, Trade, Grain, Money-fx, Wheat}. ⎧ Vn,ω ⎫ ⎪ , if n has a textual element attached ⎪ wn = ⎨ d n ⎬ ⎪ ⎪ ⎩ ⎭ 1, otherwise |Vn,w| cardinal of vocabulary attached to w, and |dn| size of textual element attached to node n in d
    15. Weighted Naive Model for Semi-structured Document Categorization Experimentation • Precision & Recall ∑ TP(ω ) ∑ TP(ω ) i i ω i ∈Ω ω R= i ∈Ω P= ∑ TP(ω ) + ω∑ FN (ω ) ∑ TP(ω ) + ω∑ FP(ω ) i i i i ωi ∈Ω i ∈Ω ω i ∈Ω i ∈Ω • TP (truePositive) FP (falsePositive) 2⋅ P F1 (P, R ) = • Finally : P+R
    16. Weighted Naive Model for Semi-structured Document Categorization Experimentation Measure NB NBS SVM SVMS measure for models NB,NBS, SVM and SVMS from Bratko and Recall 0.9623 0.9548 0.9214 0.9660 Filipic work on Reuters-21578 Precision 0.8280 0.8485 0.9658 0.9178 database 0.8901 0.8985 0.9431 0.9413 F1 Measure NB NBS SCANB measure for models NB, NBS Recall 0.9185 0.9241 0.9540 and SCANB models on Precision 0.8540 0.8586 0.9294 Reuters-21578 database, according to our tests 0.8851 0.8902 0.9415 F1
    17. Weighted Naive Model for Semi-structured Document Categorization • Conclusion Improvement of 5.6% / 5.1% against NBS model. Better classification accuracy Performs well comparatively to the SVM model • SVM seems insensitive to structural organization of documents SCANB : extends naïve Bayesian classification while integrating document structure knowledge

    + inscit2006inscit2006, 3 years ago

    custom

    1608 views, 1 favs, 0 embeds more stats

    Pierre-Franois Marteau, Gildas Mnie, Eugen Popovi more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1608
      • 1608 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories