SlideShare a Scribd company logo
1 of 20
Download to read offline
Machine Learning Methods
              for
    Text / Web Data Mining

                    Byoung-Tak Zhang
       School of Computer Science and Engineering
                 Seoul National University
              E-mail: btzhang@cse.snu.ac.kr

               This material is available at
            http://scai.snu.ac.kr./~btzhang/




Overview
 Introduction
  4Web Information Retrieval
  4Machine Learning (ML)
  4ML Methods for Text/Web Data Mining
 Text/Web Data Analysis
  4Text Mining Using Helmholtz Machines
  4Web Mining Using Bayesian Networks
 Summary
  4Current and Future Work

                                                    2
Web Information Retrieval
                    Preprocessing and Indexing
                                                               Classification System
   Text Data
                                       Text Classification




                                                 Information Extraction
               Information Filtering                            DB Template Filling
                                                                & Information
Information Filtering System                                    Extraction System
                                                                                DB Record
                                                                                Location
                                                                               Date
                     user profile
                                                    question
                   filtered data
                                                    answer
                     feedback
                                                                    DB                3




      Machine Learning
         Supervised Learning
          4Estimate an unknown mapping from known input-
           output pairs
          4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x)
          4Classification: y is discrete
          4Regression: y is continuous
         Unsupervised Learning
          4Only input values are provided
          4Learn fw from D={(x)} s.t. f w (x) = x
          4Density Estimation
          4Compression, Clustering
                                                                                      4
Machine Learning Methods
 Neural Networks
 4 Multilayer Perceptrons (MLPs)
 4 Self-Organizing Maps (SOMs)
 4 Support Vector Machines (SVMs)
 Probabilistic Models
 4 Bayesian Networks (BNs)
 4 Helmholtz Machines (HMs)
 4 Latent Variable Models (LVMs)
 Other Machine Learning Methods
 4 Evolutionary Algorithms (EAs)
 4 Reinforcement Learning (RL)
 4 Boosting Algorithms
 4 Decision Trees (DTs)
                                                         5




ML for Text/Web Data Mining
 Bayesian Networks for Text Classification
 Helmholtz Machines for Text Clustering/Categorization
 Latent Variable Models for Topic Word Extraction
 Boosted Learning for TREC Filtering Task
 Evolutionary Learning for Web Document Retrieval
 Reinforcement Learning for Web Filtering Agents
 Bayesian Networks for Web Customer Data Mining




                                                         6
Preprocessing for Text Learning
                                       0 baseball
From: xxx@sciences.sdsu.edu
                                       0 car
Newsgroups:comp.graphics
Subject: Need specs on Apple           0 clinton
QT                                     0 computer
                                       0 graphics
I need to the specs, or at least a
very verbose interpretation of the     0 hockey
specs, for QuckTime. Technical         2 quicktime
articles from magazines and
                                          .
references to books would be
nice, too.                                .
                                       1 references
I also need the specs in a format
                                       0 space
usable on a Unix or MS-Dos
system. I can’t do much with the       3 specs
QuickTime stuff they have on..         1 unix
                                          .           7




 Text Mining: Data Sets
     Usenet Newsgroup Data
      420 categories
      41000 documents for each category
      420000 documents in total.
     TDT2 Corpus
      4Target detection and tracking (TDT): NIST
      4Used 6,169 documents in experiments




                                                      8
Text Mining:
     Helmholtz Machine Architecture
                                    …
      h1                 h2                        hm                            : recognition weight
                                                                                  : generative weight
                                                                                                          1
                                                                              P (hi = 1) =
                                                                                                              n        
                                                                                              1 + exp − bi − ∑ wij d j 
                                                                                                                       
                                                                                                             j =1      
d1              d2                  d3         …               dn             P (d i = 1) =
                                                                                                          1
                                                                                                              m        
                                                                                              1 + exp − bi − ∑ wij h j 
                                                                                                                       
                                                                                                             j =1      

     4 [Chang and Zhang, 2000]                        4Latent nodes
     4 Input nodes                                           • Binary values
           • Binary values                                   • Extract the underlying causal
           • Represent the existence or                        structure in the document set.
             absence of words in documents.                  • Capture correlations of the words                9
                                                               in documents.




     Text Mining:
     Learning Helmholtz Machines

        4Introduce a recognition network for estimation of a
         generative network.
                                        
                                          (              )       T       
                                                                             ( ) ( ( ) )
                                                                                  P d (t ) ,α (t ) | θ 
                             T
            log( D | θ ) = ∑ log ∑ P d ( t ) , α (t ) | θ  = ∑ log ∑ Q α ( t )                      
                            t =1        α ( t )                 t =1    α (t )    Q α (t )          
                            T
                         ≥ ∑∑ Q α log( )          (
                                                (t )             )
                                                     P d (t ) ,α (t ) | θ
                           t =1 α ( t )                ( )
                                                        Q α (t )

        4Wake-Sleep Algorithm
               • Train the recognition and generative models alternately.
               • Update the weight in network iteratively by simple local delta rule.
                  wij = wij + ∆wij
                   new   old


                  ∆wij = γ si ( s j − p( s j = 1))
                                                                                                               10
Text Mining: Methods
  Text Categorization
   4 Train a Helmholtz machine for each category.
   4 Total N machines for N categories.
   4 Once the N machines have been estimated, classification of a test
     document proceeds by estimating the likelihood of the document
     for each machine.
     c = arg max[log P(d | c)]
     ˆ
           c∈C



  Topic Words Extraction
   4 For the entire document sets, train a Helmholtz machine.
   4 After training, examine the weights of connections from a latent
     node to input nodes, that is words.

                                                                         11




Text Mining: Categorization Results
  4Usenet Newsgroup Data
     • 20 categories, 1000 documents for each category, 20000 documents in
       total.




                                                                         12
Text Mining: Topic Words Extraction Results
      4TDT2 Corpus
      46,169 documents
  1    tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney,
       smokers, lawsuit, senate, cigarette, morris, nicotine

  2    warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds,
       mordechai, separatist, governor

  3    olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze,
       skating, lillehammer, downhill

  4    netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,
       jerusalem, eu, paris, israel

  5    India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata,
       bharatiya, islamabad, bihari

  6    Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation,
       jakarta, rioting, electoral, rallies, wiranto, unrest, megawati

  7    imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand,
       inflation, investors, fund, banks, baht

  8    pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output,
       vatican, freedom
                                                                                                              13




Web Mining: Customer Analysis
      KDD-2000 Web Mining Competition
      4Data: 465 features over 1700 customers
          • Features include friend promotion rate, date visited,
            weight of items, price of house, discount rate, …
          • Data was collected during Jan. 30 – March 30, 2000
          • Friend promotion was started from Feb. 29 with TV
             advertisement.
      4Aims: Description of heavy/low spenders


                                                                                                              14
15




16
Web Mining: Feature Selection
       Features selected by various ways [Yang & Zhang, 2000]
DecisionTree+Factor             Decision Tree                       Discriminant Model
Analysis
V368 (Weight Average)           V13 (SendEmail)                     V240 (Friend)
V243 (OrderLine Quantity Sum)   V234 (OrderItemQuantity Sum%        V229 (Order-Average)
V245 (OrderLine Quantity        HavingDiscountRange(5 . 10))        V304 (OrderShippingAmtMin.)
Maximum)                        V237 (OrderItemQuantitySum%         V368 (Weight Average)
F1 = 0.94*V324 + 0.868*V374     Having DiscountRange(10.))          V43 (Home Market Value)
+ 0.898*V412                    V240 (Friend)                       V377 (NumAcountTemplate Views)
F2 = 0.829*V234 +               V243 (OrderLineQuantitySum)         +
0.857*V240
                                V245 (OrderLineQuantity Maximum)    V11 (Which
F3 = -0.795*V237+
0.778*V304                      V304 (OrderShippingAmtMin)          DoYouWearMostFrequent)
                                V324 (NumLegwearProduct             V13 (SendEmail)
                                Views)                              V17 (USState)
                                V368 (Weight Average)               V45 (VehicleLifeStyle)
                                V374 (NumMainTemplateViews)         V68 (RetailActivity)
                                V412 (NumReplenishable              V19 (Date)
                                Stock Views)                                                  17




    Web Mining: Bayesian Nets
        Bayesian network
          4 DAG (Directed Acyclic Graph)
          4 Express dependence relations between variables
          4 Can use prior knowledge on the data (parameters)


          A           B            C
                                                P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
                                                                   P(D|A,B)P(E|B,C,D)
               D           E

          4 Examples of conjugate priors:
                 Dirichlet for multinomial data, Normal-Wishart for normal data               18
Web Mining: Results

A Bayesian net for
KDD web data

V229 (Order-Average) and
V240 (Friend) directly
influence V312 (Target)

V19 (Date) was influenced by
V240 (Friend) reflecting the
TV advertisement.



                                                                  19




   Summary
      We study machine learning methods, such as
       4 Probabilistic neural networks
       4 Evolutionary algorithms
       4 Reinforcement learning
      Application areas include
       4 Text mining
       4 Web mining
       4 Bioinformatics (not addressed in this talk)
      Recent work focuses on probabilistic graphical models for
      web/text/bio data mining, including
       4 Bayesian networks
       4 Helmholtz machines
       4 Latent variable models
                                                                  20
21




Bayesian Networks:
Architecture
                                  B                         L



                     G                       M

     P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G )
                       = P ( L) P ( B ) P (G | B ) P ( M | B, L)

 A Bayesian network represents the probabilistic
 relationships between the variables.
               n
    P ( X) = ∏ P ( X i | pa i )   pai is the set of parent nodes of Xi.   22
              i =1
Bayesian Networks:
Applications in IR – A Simple BN for Text Classification

                                       C
                                                                     • C: document class
                                                                     • ti: ith term




    t1             t2                                            t8754

     The network structure represents the naïve Bayes assumption.
     All nodes are binary.
     [Hwang & Zhang, 2000]                                   23




  Bayesian Networks:
  Experimental Results
     Dataset
         4The acq dataset from Reuters-21578
         48754 terms were selected by TFIDF.
         4Training data: 8762 documents
         4Test data: 3009 documents
     Parametric Learning
         4Dirichlet prior assumptions for the network parameter
          distributions.
            p (θ ij | S h ) = Dir (θ ij | α ij1 ,..., α ijri )
         4Parameter distributions are updated with training data.
           p(θ ij | D, S h ) = Dir(θ ij | αij1 + Nij1 ,...,αijri + Nijri )
                                                                                      24
Bayesian Networks:
 Experimental Results
        For training data
         4Accuracy: 94.28%
                                            Recall (%)                Precision (%)
     Positive examples                          96.83                      75.98
    Negative examples                           93.76                      99.32

        For test data
         4Accuracy: 96.51%
                                            Recall (%)                 Precision (%)
     Positive examples                           95.16                      89.17
    Negative examples                            96.88                      98.67               25




Latent Variable Models:
Architecture
 [Shin & Zhang, 2000]
 Maximize log-likelihood
     N   M
L = ∑∑ n(d n , wm ) log P(d n , wm )
    n =1 m =1                                                 Document
    N    M                K                                   Clustering
 = ∑∑ n(d n , wm ) log ∑ P( zk ) P( wm | zk ) P(d n | zk )
    n =1 m =1            k =1


                                                             Topic-Words
                                                             Extraction
 4Update P( zk ) , P(wm | zk ) , P(d n | zk ) .
 4With EM Algorithm
                                                                       Latent Variable Model for
                                                             Topic Words Extraction and Document Clustering



                                                                                                26
Latent Variable Models:
 Learning
    EM (Expectation-Maximization) Algorithm
    4Algorithm to maximize pre-defined log-likelihood
    Iteration of E-Step and M-Step
    4E-Step                                          M-Step
                                                                           N
      P( zk | d n , wm )
                                                                         ∑ n( d    n   , wm ) P( z k | d n , wm )
            P( zk ) P(d n | zk ) P( wm | zk )        P( wm | z k ) =     n =1
      =                                                                 M N

                                                                       ∑∑ n(d
          K
                                                                                           , wm ) P( z k | d n , wm )
          ∑ P( z ) P( d
          k =1
                  k        n   | zk ) P( wm | zk )                     m =1 n =1
                                                                                       n

                                                                          M

                                                                         ∑ n( d    n   , wm ) P( z k | d n , wm )
                                                     P(d n | z k ) =    m =1
                                                                       M N

                                                                       ∑∑ n(d
                                                                       m =1 n =1
                                                                                       n   , wm ) P( z k | d n , wm )

                                                                  1 M N
                                                     P( z k ) =     ∑∑ n(d n , wm ) P( z k | d n , wm ),
                                                                  R m=1 n =1
                                                           M      N
                                                                                                                        27
                                                     R ≡ ∑∑ n(d n , wm )
                                                          m =1 n =1




Latent Variable Models:
Applications in IR – Experimental Results

 Topic Words Extraction and Document Clustering with a
 subset of TREC-8 data
 TREC-8 adhoc task data
  4 Documents: DTDS, FR94, FT, FBIS, LATIMES
  4 Topics: 401-450 (401, 434, 439, 450)
  4 401: Foreign Minorities, Germany
  4 434: Estonia, Economy
  4 439: Inventions, Scientific discovery
  4 450: King Hussein, Peace




                                                                                                                        28
Latent Variable Models:
   Applications in IR – Experimental Results
            Label (assigned to zk with Maximum P(di|zk) )
 Topic (#Docs)   z2    z4    z3     z1    Precision    Recall
 401 (300)       279 1       0      20    0.902        0.930
 434 (347)       20    238 10       79    0.996        0.686
 439 (219)       7     0     203 9        0.953        0.927
 450 (293)       3     0     0      290 0.729          0.990
Topics     Extracted Topic Words (top 35 words with highest P(wj|zk)
Cluster 2    german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation,
(z2 )        law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member,
             attack, …
Cluster 4    percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian,
(z4)         econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, …

Cluster 3    research, technology, develop, mar, materi, system, nuclear, environment, electr, process,
(z3)         product, power, energi, countrol, japan, pollution, structur, chemic, plant, …
Cluster 1    jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti,
(z1)         negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat,
             jerusalem, rabin, syria, iraq, …
                                                                                                          29




        Boosting:
        Algorithms
            A general method of converting rough rules into a highly accurate
            prediction rule
            Learning procedure
             4 Examine the training set
             4 Derive a rough rule (weak learner)
             4 Re-weight the examples in the training set, concentrating on the hard cases for
               previous rules
             4 Repeat T times

                                                                                   Importance weights
                                                                                   of training documents

            Learner             Learner              Learner             Learner


               h1                   h2                   h3                   h4            f ( h1 , h2 , h3 , h4 )
                                                                                                          30
Boosting:
  Applied to Text Filtering
    Naïve Bayes
     4 Traditional algorithm for text filtering
       c NM =          arg max                   P (c j ) P ( d i | c j )
                c j ∈{ relevant , irrelevant }                                    Assume independence
                                            n
            = arg max P (c j )∏ P ( wik | c j )
                                                                                  among terms
                     cj                    k =1

            = arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ...
                     cj

             P ( win =" trouble"| c j )
    Boosting naïve Bayes
     4 Using naïve Bayes classifiers as weak learners
     4 [Kim & Zhang, SIGIR-2000]

                                                                                                         31




Boosting:
Applied to Text Filtering – Experimental Results
    TREC (Text Retrieval Conference)                                        TREC-8 filtering datasets
     4 Sponsored by NIST                                                     4 Training Documents
    TREC-7 filtering datasets                                                    • Financial Times (1991~1992)
     4 Training Documents                                                        • 167 MB, 64139 documents
         • AP articles (1988)
                                                                             4 Test Documents
         • 237 MB, 79919 documents
                                                                                 • Financial Time (1993~1994)
     4 Test Documents
                                                                                 • 382 MB, 140651 documents
         • AP articles (1989~1990)
         • 471 MB, 162999 documents                                          4 No. of topics: 50
     4 No. of topics: 50



      Example of a document


                                                                                                         32
Boosting:
Applied to Text Filtering – Experimental Results

 Compared with the state-of-the-art text filtering systems
    TREC-7
  Averaged Scaled F1                       Averaged Scaled F3
  Boosting    ATT       NTT     PIRC       Boosting         ATT     NTT     PIRC


  0.474       0.461     0.452   0.500      0.467            0.460   0.505   0.509



    TREC-8
  Averaged Scaled LF1                      Averaged Scaled LF2
  Boosting    PLT1      PLT2    PIRC       Boosting         CL      PIRC    Mer


  0.717       0.712     0.713   0.714      0.722            0.721   0.734   0.720   33




  Evolutionary Learning:
  Applications in IR - Web-Document Retrieval
  [Kim & Zhang, 2000]




                 Link Information,
                    HTML Tag                                 Retrieval
                    Information
                                      <TITLE> <H> <B>   …   <A>
                                        w11 w22 w33 … wnn
                                        w1 w2 w3 … wn
                                         w w w … w                                  Fitness


                        chromosomes           w11 w22 w33 … wnn
                                              w1 w2 w3 … wn
                                               w w w … w                            34
Evolutionary Learning:
      Applications in IR – Tag Weighting

          Crossover                                      Mutation
     chromosome X                   chromosome Y                    chromosome X
x1   x2   x3   … xn          y1     y2       y3   … yn        x1    x2   x3     … xn



               zi = (xi + yi ) / 2 w.p. Pc                                    change value w.p. Pm




               z1    z2     z3     … zn                       x1    x2   x3     … xn
               chromosome Z (offspring)                             chromosome X’



          Truncation selection
                                                                                              35




      Evolutionary Learning :
      Applications in IR - Experimental Results
          Datasets
           4 TREC-8 Web Track Data
           4 2GB, 247491 web documents (WT2g)
           4 No. of training topics: 10, No. of test topics: 10
          Results




                                                                                              36
Reinforcement Learning:
      Basic Concept

                                         Agent




1. State st     Reward rt                                             2. Action at

                 3. Reward rt+1

                                      Environment

                   4. State st+1                                          37




      Reinforcement Learning:
      Applications in IR - Information Filtering
     [Seo & Zhang, 2000]
                                                 •retrieve documents
                                                 •calculate similarity
                                     WAIR
                                                      2. Actioni
                                                      (modify profile)
   1. Statei      Rewar                                User profile
   (user profile) di
                                                    Document filtering
                       3. Rewardi+1 (relevance
                       feedback)
                                     User                   ...
                                                     Filtered documents
                       4. Statei+1
                                                                          38
Reinforcement Learning:
Experimental Results (Explicit Feedback)
(%)




                                           39




Reinforcement Learning:
Experimental Results (Implicit Feedback)
 (%)




                                           40

More Related Content

What's hot

What's hot (6)

JavaYDL6
JavaYDL6JavaYDL6
JavaYDL6
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
Bt32444450
Bt32444450Bt32444450
Bt32444450
 
11slide
11slide11slide
11slide
 
Clustering
ClusteringClustering
Clustering
 
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationEfficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program Transformation
 

Viewers also liked

CS532.doc
CS532.docCS532.doc
CS532.docbutest
 
НИР, презентация инструмента
НИР, презентация инструментаНИР, презентация инструмента
НИР, презентация инструментаЗАО "НИР"
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdfasenju
 
Preview Class Handout "
Preview Class Handout "Preview Class Handout "
Preview Class Handout "butest
 
Proceedings
ProceedingsProceedings
Proceedingsbutest
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.pptbutest
 
C:\Fakepath\презентация лекарств
C:\Fakepath\презентация лекарствC:\Fakepath\презентация лекарств
C:\Fakepath\презентация лекарствgmc999
 
web_designer_bb-1.doc
web_designer_bb-1.docweb_designer_bb-1.doc
web_designer_bb-1.docbutest
 
Map the life cycle of an item of clothing, consider the energy ...
Map the life cycle of an item of clothing, consider the energy ...Map the life cycle of an item of clothing, consider the energy ...
Map the life cycle of an item of clothing, consider the energy ...butest
 
Osteo Intraorganelle Nanoporation under Electrical Stimuli
Osteo Intraorganelle Nanoporation under Electrical StimuliOsteo Intraorganelle Nanoporation under Electrical Stimuli
Osteo Intraorganelle Nanoporation under Electrical StimuliIJERD Editor
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Creativity, Media and the Future of You
Creativity, Media and the Future of YouCreativity, Media and the Future of You
Creativity, Media and the Future of YouSimon Mainwaring
 
osm.cs.byu.edu
osm.cs.byu.eduosm.cs.byu.edu
osm.cs.byu.edubutest
 
Artem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаArtem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаguest092df8
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Network Administration
Network AdministrationNetwork Administration
Network Administrationbutest
 

Viewers also liked (20)

CS532.doc
CS532.docCS532.doc
CS532.doc
 
НИР, презентация инструмента
НИР, презентация инструментаНИР, презентация инструмента
НИР, презентация инструмента
 
Tearn Up pitch deck.pdf
Tearn Up pitch deck.pdfTearn Up pitch deck.pdf
Tearn Up pitch deck.pdf
 
Preview Class Handout "
Preview Class Handout "Preview Class Handout "
Preview Class Handout "
 
Proceedings
ProceedingsProceedings
Proceedings
 
mailfilter.ppt
mailfilter.pptmailfilter.ppt
mailfilter.ppt
 
New Media Language
New Media LanguageNew Media Language
New Media Language
 
C:\Fakepath\презентация лекарств
C:\Fakepath\презентация лекарствC:\Fakepath\презентация лекарств
C:\Fakepath\презентация лекарств
 
web_designer_bb-1.doc
web_designer_bb-1.docweb_designer_bb-1.doc
web_designer_bb-1.doc
 
Map the life cycle of an item of clothing, consider the energy ...
Map the life cycle of an item of clothing, consider the energy ...Map the life cycle of an item of clothing, consider the energy ...
Map the life cycle of an item of clothing, consider the energy ...
 
HDsamplesLMC
HDsamplesLMCHDsamplesLMC
HDsamplesLMC
 
Fotocuriose
FotocurioseFotocuriose
Fotocuriose
 
Osteo Intraorganelle Nanoporation under Electrical Stimuli
Osteo Intraorganelle Nanoporation under Electrical StimuliOsteo Intraorganelle Nanoporation under Electrical Stimuli
Osteo Intraorganelle Nanoporation under Electrical Stimuli
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
Creativity, Media and the Future of You
Creativity, Media and the Future of YouCreativity, Media and the Future of You
Creativity, Media and the Future of You
 
osm.cs.byu.edu
osm.cs.byu.eduosm.cs.byu.edu
osm.cs.byu.edu
 
Artem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаArtem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банка
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
environmental feature writing dharman wickremaretne
environmental feature writing   dharman wickremaretneenvironmental feature writing   dharman wickremaretne
environmental feature writing dharman wickremaretne
 
Network Administration
Network AdministrationNetwork Administration
Network Administration
 

Similar to Microsoft PowerPoint - ml4textweb00

A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...National Institute of Informatics
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09bwhitcher
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpAdrian Ziegler
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Miningbutest
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
ML Basic Concepts.pdf
ML Basic Concepts.pdfML Basic Concepts.pdf
ML Basic Concepts.pdfManishaS49
 
Coursebreakup
CoursebreakupCoursebreakup
CoursebreakupPCTE
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
 
Searching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! AnswersSearching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! AnswersGabriela Agustini
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataHang Dong
 

Similar to Microsoft PowerPoint - ml4textweb00 (20)

A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09
 
Lise Getoor, "
Lise Getoor, "Lise Getoor, "
Lise Getoor, "
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
ML Basic Concepts.pdf
ML Basic Concepts.pdfML Basic Concepts.pdf
ML Basic Concepts.pdf
 
Coursebreakup
CoursebreakupCoursebreakup
Coursebreakup
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Icpp power ai-workshop 2018
Icpp power ai-workshop 2018Icpp power ai-workshop 2018
Icpp power ai-workshop 2018
 
Searching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! AnswersSearching for Interestingness in Wikipedia and Yahoo! Answers
Searching for Interestingness in Wikipedia and Yahoo! Answers
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging data
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Microsoft PowerPoint - ml4textweb00

  • 1. Machine Learning Methods for Text / Web Data Mining Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University E-mail: btzhang@cse.snu.ac.kr This material is available at http://scai.snu.ac.kr./~btzhang/ Overview Introduction 4Web Information Retrieval 4Machine Learning (ML) 4ML Methods for Text/Web Data Mining Text/Web Data Analysis 4Text Mining Using Helmholtz Machines 4Web Mining Using Bayesian Networks Summary 4Current and Future Work 2
  • 2. Web Information Retrieval Preprocessing and Indexing Classification System Text Data Text Classification Information Extraction Information Filtering DB Template Filling & Information Information Filtering System Extraction System DB Record Location Date user profile question filtered data answer feedback DB 3 Machine Learning Supervised Learning 4Estimate an unknown mapping from known input- output pairs 4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x) 4Classification: y is discrete 4Regression: y is continuous Unsupervised Learning 4Only input values are provided 4Learn fw from D={(x)} s.t. f w (x) = x 4Density Estimation 4Compression, Clustering 4
  • 3. Machine Learning Methods Neural Networks 4 Multilayer Perceptrons (MLPs) 4 Self-Organizing Maps (SOMs) 4 Support Vector Machines (SVMs) Probabilistic Models 4 Bayesian Networks (BNs) 4 Helmholtz Machines (HMs) 4 Latent Variable Models (LVMs) Other Machine Learning Methods 4 Evolutionary Algorithms (EAs) 4 Reinforcement Learning (RL) 4 Boosting Algorithms 4 Decision Trees (DTs) 5 ML for Text/Web Data Mining Bayesian Networks for Text Classification Helmholtz Machines for Text Clustering/Categorization Latent Variable Models for Topic Word Extraction Boosted Learning for TREC Filtering Task Evolutionary Learning for Web Document Retrieval Reinforcement Learning for Web Filtering Agents Bayesian Networks for Web Customer Data Mining 6
  • 4. Preprocessing for Text Learning 0 baseball From: xxx@sciences.sdsu.edu 0 car Newsgroups:comp.graphics Subject: Need specs on Apple 0 clinton QT 0 computer 0 graphics I need to the specs, or at least a very verbose interpretation of the 0 hockey specs, for QuckTime. Technical 2 quicktime articles from magazines and . references to books would be nice, too. . 1 references I also need the specs in a format 0 space usable on a Unix or MS-Dos system. I can’t do much with the 3 specs QuickTime stuff they have on.. 1 unix . 7 Text Mining: Data Sets Usenet Newsgroup Data 420 categories 41000 documents for each category 420000 documents in total. TDT2 Corpus 4Target detection and tracking (TDT): NIST 4Used 6,169 documents in experiments 8
  • 5. Text Mining: Helmholtz Machine Architecture … h1 h2 hm : recognition weight : generative weight 1 P (hi = 1) =  n  1 + exp − bi − ∑ wij d j     j =1  d1 d2 d3 … dn P (d i = 1) = 1  m  1 + exp − bi − ∑ wij h j     j =1  4 [Chang and Zhang, 2000] 4Latent nodes 4 Input nodes • Binary values • Binary values • Extract the underlying causal • Represent the existence or structure in the document set. absence of words in documents. • Capture correlations of the words 9 in documents. Text Mining: Learning Helmholtz Machines 4Introduce a recognition network for estimation of a generative network.  ( )  T  ( ) ( ( ) ) P d (t ) ,α (t ) | θ  T log( D | θ ) = ∑ log ∑ P d ( t ) , α (t ) | θ  = ∑ log ∑ Q α ( t )  t =1 α ( t )  t =1 α (t ) Q α (t )  T ≥ ∑∑ Q α log( ) ( (t ) ) P d (t ) ,α (t ) | θ t =1 α ( t ) ( ) Q α (t ) 4Wake-Sleep Algorithm • Train the recognition and generative models alternately. • Update the weight in network iteratively by simple local delta rule. wij = wij + ∆wij new old ∆wij = γ si ( s j − p( s j = 1)) 10
  • 6. Text Mining: Methods Text Categorization 4 Train a Helmholtz machine for each category. 4 Total N machines for N categories. 4 Once the N machines have been estimated, classification of a test document proceeds by estimating the likelihood of the document for each machine. c = arg max[log P(d | c)] ˆ c∈C Topic Words Extraction 4 For the entire document sets, train a Helmholtz machine. 4 After training, examine the weights of connections from a latent node to input nodes, that is words. 11 Text Mining: Categorization Results 4Usenet Newsgroup Data • 20 categories, 1000 documents for each category, 20000 documents in total. 12
  • 7. Text Mining: Topic Words Extraction Results 4TDT2 Corpus 46,169 documents 1 tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney, smokers, lawsuit, senate, cigarette, morris, nicotine 2 warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds, mordechai, separatist, governor 3 olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze, skating, lillehammer, downhill 4 netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza, jerusalem, eu, paris, israel 5 India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata, bharatiya, islamabad, bihari 6 Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation, jakarta, rioting, electoral, rallies, wiranto, unrest, megawati 7 imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand, inflation, investors, fund, banks, baht 8 pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output, vatican, freedom 13 Web Mining: Customer Analysis KDD-2000 Web Mining Competition 4Data: 465 features over 1700 customers • Features include friend promotion rate, date visited, weight of items, price of house, discount rate, … • Data was collected during Jan. 30 – March 30, 2000 • Friend promotion was started from Feb. 29 with TV advertisement. 4Aims: Description of heavy/low spenders 14
  • 9. Web Mining: Feature Selection Features selected by various ways [Yang & Zhang, 2000] DecisionTree+Factor Decision Tree Discriminant Model Analysis V368 (Weight Average) V13 (SendEmail) V240 (Friend) V243 (OrderLine Quantity Sum) V234 (OrderItemQuantity Sum% V229 (Order-Average) V245 (OrderLine Quantity HavingDiscountRange(5 . 10)) V304 (OrderShippingAmtMin.) Maximum) V237 (OrderItemQuantitySum% V368 (Weight Average) F1 = 0.94*V324 + 0.868*V374 Having DiscountRange(10.)) V43 (Home Market Value) + 0.898*V412 V240 (Friend) V377 (NumAcountTemplate Views) F2 = 0.829*V234 + V243 (OrderLineQuantitySum) + 0.857*V240 V245 (OrderLineQuantity Maximum) V11 (Which F3 = -0.795*V237+ 0.778*V304 V304 (OrderShippingAmtMin) DoYouWearMostFrequent) V324 (NumLegwearProduct V13 (SendEmail) Views) V17 (USState) V368 (Weight Average) V45 (VehicleLifeStyle) V374 (NumMainTemplateViews) V68 (RetailActivity) V412 (NumReplenishable V19 (Date) Stock Views) 17 Web Mining: Bayesian Nets Bayesian network 4 DAG (Directed Acyclic Graph) 4 Express dependence relations between variables 4 Can use prior knowledge on the data (parameters) A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B) P(D|A,B)P(E|B,C,D) D E 4 Examples of conjugate priors: Dirichlet for multinomial data, Normal-Wishart for normal data 18
  • 10. Web Mining: Results A Bayesian net for KDD web data V229 (Order-Average) and V240 (Friend) directly influence V312 (Target) V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement. 19 Summary We study machine learning methods, such as 4 Probabilistic neural networks 4 Evolutionary algorithms 4 Reinforcement learning Application areas include 4 Text mining 4 Web mining 4 Bioinformatics (not addressed in this talk) Recent work focuses on probabilistic graphical models for web/text/bio data mining, including 4 Bayesian networks 4 Helmholtz machines 4 Latent variable models 20
  • 11. 21 Bayesian Networks: Architecture B L G M P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G ) = P ( L) P ( B ) P (G | B ) P ( M | B, L) A Bayesian network represents the probabilistic relationships between the variables. n P ( X) = ∏ P ( X i | pa i ) pai is the set of parent nodes of Xi. 22 i =1
  • 12. Bayesian Networks: Applications in IR – A Simple BN for Text Classification C • C: document class • ti: ith term t1 t2 t8754 The network structure represents the naïve Bayes assumption. All nodes are binary. [Hwang & Zhang, 2000] 23 Bayesian Networks: Experimental Results Dataset 4The acq dataset from Reuters-21578 48754 terms were selected by TFIDF. 4Training data: 8762 documents 4Test data: 3009 documents Parametric Learning 4Dirichlet prior assumptions for the network parameter distributions. p (θ ij | S h ) = Dir (θ ij | α ij1 ,..., α ijri ) 4Parameter distributions are updated with training data. p(θ ij | D, S h ) = Dir(θ ij | αij1 + Nij1 ,...,αijri + Nijri ) 24
  • 13. Bayesian Networks: Experimental Results For training data 4Accuracy: 94.28% Recall (%) Precision (%) Positive examples 96.83 75.98 Negative examples 93.76 99.32 For test data 4Accuracy: 96.51% Recall (%) Precision (%) Positive examples 95.16 89.17 Negative examples 96.88 98.67 25 Latent Variable Models: Architecture [Shin & Zhang, 2000] Maximize log-likelihood N M L = ∑∑ n(d n , wm ) log P(d n , wm ) n =1 m =1 Document N M K Clustering = ∑∑ n(d n , wm ) log ∑ P( zk ) P( wm | zk ) P(d n | zk ) n =1 m =1 k =1 Topic-Words Extraction 4Update P( zk ) , P(wm | zk ) , P(d n | zk ) . 4With EM Algorithm Latent Variable Model for Topic Words Extraction and Document Clustering 26
  • 14. Latent Variable Models: Learning EM (Expectation-Maximization) Algorithm 4Algorithm to maximize pre-defined log-likelihood Iteration of E-Step and M-Step 4E-Step M-Step N P( zk | d n , wm ) ∑ n( d n , wm ) P( z k | d n , wm ) P( zk ) P(d n | zk ) P( wm | zk ) P( wm | z k ) = n =1 = M N ∑∑ n(d K , wm ) P( z k | d n , wm ) ∑ P( z ) P( d k =1 k n | zk ) P( wm | zk ) m =1 n =1 n M ∑ n( d n , wm ) P( z k | d n , wm ) P(d n | z k ) = m =1 M N ∑∑ n(d m =1 n =1 n , wm ) P( z k | d n , wm ) 1 M N P( z k ) = ∑∑ n(d n , wm ) P( z k | d n , wm ), R m=1 n =1 M N 27 R ≡ ∑∑ n(d n , wm ) m =1 n =1 Latent Variable Models: Applications in IR – Experimental Results Topic Words Extraction and Document Clustering with a subset of TREC-8 data TREC-8 adhoc task data 4 Documents: DTDS, FR94, FT, FBIS, LATIMES 4 Topics: 401-450 (401, 434, 439, 450) 4 401: Foreign Minorities, Germany 4 434: Estonia, Economy 4 439: Inventions, Scientific discovery 4 450: King Hussein, Peace 28
  • 15. Latent Variable Models: Applications in IR – Experimental Results Label (assigned to zk with Maximum P(di|zk) ) Topic (#Docs) z2 z4 z3 z1 Precision Recall 401 (300) 279 1 0 20 0.902 0.930 434 (347) 20 238 10 79 0.996 0.686 439 (219) 7 0 203 9 0.953 0.927 450 (293) 3 0 0 290 0.729 0.990 Topics Extracted Topic Words (top 35 words with highest P(wj|zk) Cluster 2 german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation, (z2 ) law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member, attack, … Cluster 4 percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian, (z4) econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, … Cluster 3 research, technology, develop, mar, materi, system, nuclear, environment, electr, process, (z3) product, power, energi, countrol, japan, pollution, structur, chemic, plant, … Cluster 1 jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti, (z1) negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat, jerusalem, rabin, syria, iraq, … 29 Boosting: Algorithms A general method of converting rough rules into a highly accurate prediction rule Learning procedure 4 Examine the training set 4 Derive a rough rule (weak learner) 4 Re-weight the examples in the training set, concentrating on the hard cases for previous rules 4 Repeat T times Importance weights of training documents Learner Learner Learner Learner h1 h2 h3 h4 f ( h1 , h2 , h3 , h4 ) 30
  • 16. Boosting: Applied to Text Filtering Naïve Bayes 4 Traditional algorithm for text filtering c NM = arg max P (c j ) P ( d i | c j ) c j ∈{ relevant , irrelevant } Assume independence n = arg max P (c j )∏ P ( wik | c j ) among terms cj k =1 = arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ... cj P ( win =" trouble"| c j ) Boosting naïve Bayes 4 Using naïve Bayes classifiers as weak learners 4 [Kim & Zhang, SIGIR-2000] 31 Boosting: Applied to Text Filtering – Experimental Results TREC (Text Retrieval Conference) TREC-8 filtering datasets 4 Sponsored by NIST 4 Training Documents TREC-7 filtering datasets • Financial Times (1991~1992) 4 Training Documents • 167 MB, 64139 documents • AP articles (1988) 4 Test Documents • 237 MB, 79919 documents • Financial Time (1993~1994) 4 Test Documents • 382 MB, 140651 documents • AP articles (1989~1990) • 471 MB, 162999 documents 4 No. of topics: 50 4 No. of topics: 50 Example of a document 32
  • 17. Boosting: Applied to Text Filtering – Experimental Results Compared with the state-of-the-art text filtering systems TREC-7 Averaged Scaled F1 Averaged Scaled F3 Boosting ATT NTT PIRC Boosting ATT NTT PIRC 0.474 0.461 0.452 0.500 0.467 0.460 0.505 0.509 TREC-8 Averaged Scaled LF1 Averaged Scaled LF2 Boosting PLT1 PLT2 PIRC Boosting CL PIRC Mer 0.717 0.712 0.713 0.714 0.722 0.721 0.734 0.720 33 Evolutionary Learning: Applications in IR - Web-Document Retrieval [Kim & Zhang, 2000] Link Information, HTML Tag Retrieval Information <TITLE> <H> <B> … <A> w11 w22 w33 … wnn w1 w2 w3 … wn w w w … w Fitness chromosomes w11 w22 w33 … wnn w1 w2 w3 … wn w w w … w 34
  • 18. Evolutionary Learning: Applications in IR – Tag Weighting Crossover Mutation chromosome X chromosome Y chromosome X x1 x2 x3 … xn y1 y2 y3 … yn x1 x2 x3 … xn zi = (xi + yi ) / 2 w.p. Pc change value w.p. Pm z1 z2 z3 … zn x1 x2 x3 … xn chromosome Z (offspring) chromosome X’ Truncation selection 35 Evolutionary Learning : Applications in IR - Experimental Results Datasets 4 TREC-8 Web Track Data 4 2GB, 247491 web documents (WT2g) 4 No. of training topics: 10, No. of test topics: 10 Results 36
  • 19. Reinforcement Learning: Basic Concept Agent 1. State st Reward rt 2. Action at 3. Reward rt+1 Environment 4. State st+1 37 Reinforcement Learning: Applications in IR - Information Filtering [Seo & Zhang, 2000] •retrieve documents •calculate similarity WAIR 2. Actioni (modify profile) 1. Statei Rewar User profile (user profile) di Document filtering 3. Rewardi+1 (relevance feedback) User ... Filtered documents 4. Statei+1 38
  • 20. Reinforcement Learning: Experimental Results (Explicit Feedback) (%) 39 Reinforcement Learning: Experimental Results (Implicit Feedback) (%) 40