Microsoft PowerPoint - ml4textweb00

  • 1,114 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,114
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Machine Learning Methods for Text / Web Data Mining Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University E-mail: btzhang@cse.snu.ac.kr This material is available at http://scai.snu.ac.kr./~btzhang/ Overview Introduction 4Web Information Retrieval 4Machine Learning (ML) 4ML Methods for Text/Web Data Mining Text/Web Data Analysis 4Text Mining Using Helmholtz Machines 4Web Mining Using Bayesian Networks Summary 4Current and Future Work 2
  • 2. Web Information Retrieval Preprocessing and Indexing Classification System Text Data Text Classification Information Extraction Information Filtering DB Template Filling & Information Information Filtering System Extraction System DB Record Location Date user profile question filtered data answer feedback DB 3 Machine Learning Supervised Learning 4Estimate an unknown mapping from known input- output pairs 4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x) 4Classification: y is discrete 4Regression: y is continuous Unsupervised Learning 4Only input values are provided 4Learn fw from D={(x)} s.t. f w (x) = x 4Density Estimation 4Compression, Clustering 4
  • 3. Machine Learning Methods Neural Networks 4 Multilayer Perceptrons (MLPs) 4 Self-Organizing Maps (SOMs) 4 Support Vector Machines (SVMs) Probabilistic Models 4 Bayesian Networks (BNs) 4 Helmholtz Machines (HMs) 4 Latent Variable Models (LVMs) Other Machine Learning Methods 4 Evolutionary Algorithms (EAs) 4 Reinforcement Learning (RL) 4 Boosting Algorithms 4 Decision Trees (DTs) 5 ML for Text/Web Data Mining Bayesian Networks for Text Classification Helmholtz Machines for Text Clustering/Categorization Latent Variable Models for Topic Word Extraction Boosted Learning for TREC Filtering Task Evolutionary Learning for Web Document Retrieval Reinforcement Learning for Web Filtering Agents Bayesian Networks for Web Customer Data Mining 6
  • 4. Preprocessing for Text Learning 0 baseball From: xxx@sciences.sdsu.edu 0 car Newsgroups:comp.graphics Subject: Need specs on Apple 0 clinton QT 0 computer 0 graphics I need to the specs, or at least a very verbose interpretation of the 0 hockey specs, for QuckTime. Technical 2 quicktime articles from magazines and . references to books would be nice, too. . 1 references I also need the specs in a format 0 space usable on a Unix or MS-Dos system. I can’t do much with the 3 specs QuickTime stuff they have on.. 1 unix . 7 Text Mining: Data Sets Usenet Newsgroup Data 420 categories 41000 documents for each category 420000 documents in total. TDT2 Corpus 4Target detection and tracking (TDT): NIST 4Used 6,169 documents in experiments 8
  • 5. Text Mining: Helmholtz Machine Architecture … h1 h2 hm : recognition weight : generative weight 1 P (hi = 1) =  n  1 + exp − bi − ∑ wij d j     j =1  d1 d2 d3 … dn P (d i = 1) = 1  m  1 + exp − bi − ∑ wij h j     j =1  4 [Chang and Zhang, 2000] 4Latent nodes 4 Input nodes • Binary values • Binary values • Extract the underlying causal • Represent the existence or structure in the document set. absence of words in documents. • Capture correlations of the words 9 in documents. Text Mining: Learning Helmholtz Machines 4Introduce a recognition network for estimation of a generative network.  ( )  T  ( ) ( ( ) ) P d (t ) ,α (t ) | θ  T log( D | θ ) = ∑ log ∑ P d ( t ) , α (t ) | θ  = ∑ log ∑ Q α ( t )  t =1 α ( t )  t =1 α (t ) Q α (t )  T ≥ ∑∑ Q α log( ) ( (t ) ) P d (t ) ,α (t ) | θ t =1 α ( t ) ( ) Q α (t ) 4Wake-Sleep Algorithm • Train the recognition and generative models alternately. • Update the weight in network iteratively by simple local delta rule. wij = wij + ∆wij new old ∆wij = γ si ( s j − p( s j = 1)) 10
  • 6. Text Mining: Methods Text Categorization 4 Train a Helmholtz machine for each category. 4 Total N machines for N categories. 4 Once the N machines have been estimated, classification of a test document proceeds by estimating the likelihood of the document for each machine. c = arg max[log P(d | c)] ˆ c∈C Topic Words Extraction 4 For the entire document sets, train a Helmholtz machine. 4 After training, examine the weights of connections from a latent node to input nodes, that is words. 11 Text Mining: Categorization Results 4Usenet Newsgroup Data • 20 categories, 1000 documents for each category, 20000 documents in total. 12
  • 7. Text Mining: Topic Words Extraction Results 4TDT2 Corpus 46,169 documents 1 tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney, smokers, lawsuit, senate, cigarette, morris, nicotine 2 warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds, mordechai, separatist, governor 3 olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze, skating, lillehammer, downhill 4 netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza, jerusalem, eu, paris, israel 5 India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata, bharatiya, islamabad, bihari 6 Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation, jakarta, rioting, electoral, rallies, wiranto, unrest, megawati 7 imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand, inflation, investors, fund, banks, baht 8 pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output, vatican, freedom 13 Web Mining: Customer Analysis KDD-2000 Web Mining Competition 4Data: 465 features over 1700 customers • Features include friend promotion rate, date visited, weight of items, price of house, discount rate, … • Data was collected during Jan. 30 – March 30, 2000 • Friend promotion was started from Feb. 29 with TV advertisement. 4Aims: Description of heavy/low spenders 14
  • 8. 15 16
  • 9. Web Mining: Feature Selection Features selected by various ways [Yang & Zhang, 2000] DecisionTree+Factor Decision Tree Discriminant Model Analysis V368 (Weight Average) V13 (SendEmail) V240 (Friend) V243 (OrderLine Quantity Sum) V234 (OrderItemQuantity Sum% V229 (Order-Average) V245 (OrderLine Quantity HavingDiscountRange(5 . 10)) V304 (OrderShippingAmtMin.) Maximum) V237 (OrderItemQuantitySum% V368 (Weight Average) F1 = 0.94*V324 + 0.868*V374 Having DiscountRange(10.)) V43 (Home Market Value) + 0.898*V412 V240 (Friend) V377 (NumAcountTemplate Views) F2 = 0.829*V234 + V243 (OrderLineQuantitySum) + 0.857*V240 V245 (OrderLineQuantity Maximum) V11 (Which F3 = -0.795*V237+ 0.778*V304 V304 (OrderShippingAmtMin) DoYouWearMostFrequent) V324 (NumLegwearProduct V13 (SendEmail) Views) V17 (USState) V368 (Weight Average) V45 (VehicleLifeStyle) V374 (NumMainTemplateViews) V68 (RetailActivity) V412 (NumReplenishable V19 (Date) Stock Views) 17 Web Mining: Bayesian Nets Bayesian network 4 DAG (Directed Acyclic Graph) 4 Express dependence relations between variables 4 Can use prior knowledge on the data (parameters) A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B) P(D|A,B)P(E|B,C,D) D E 4 Examples of conjugate priors: Dirichlet for multinomial data, Normal-Wishart for normal data 18
  • 10. Web Mining: Results A Bayesian net for KDD web data V229 (Order-Average) and V240 (Friend) directly influence V312 (Target) V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement. 19 Summary We study machine learning methods, such as 4 Probabilistic neural networks 4 Evolutionary algorithms 4 Reinforcement learning Application areas include 4 Text mining 4 Web mining 4 Bioinformatics (not addressed in this talk) Recent work focuses on probabilistic graphical models for web/text/bio data mining, including 4 Bayesian networks 4 Helmholtz machines 4 Latent variable models 20
  • 11. 21 Bayesian Networks: Architecture B L G M P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G ) = P ( L) P ( B ) P (G | B ) P ( M | B, L) A Bayesian network represents the probabilistic relationships between the variables. n P ( X) = ∏ P ( X i | pa i ) pai is the set of parent nodes of Xi. 22 i =1
  • 12. Bayesian Networks: Applications in IR – A Simple BN for Text Classification C • C: document class • ti: ith term t1 t2 t8754 The network structure represents the naïve Bayes assumption. All nodes are binary. [Hwang & Zhang, 2000] 23 Bayesian Networks: Experimental Results Dataset 4The acq dataset from Reuters-21578 48754 terms were selected by TFIDF. 4Training data: 8762 documents 4Test data: 3009 documents Parametric Learning 4Dirichlet prior assumptions for the network parameter distributions. p (θ ij | S h ) = Dir (θ ij | α ij1 ,..., α ijri ) 4Parameter distributions are updated with training data. p(θ ij | D, S h ) = Dir(θ ij | αij1 + Nij1 ,...,αijri + Nijri ) 24
  • 13. Bayesian Networks: Experimental Results For training data 4Accuracy: 94.28% Recall (%) Precision (%) Positive examples 96.83 75.98 Negative examples 93.76 99.32 For test data 4Accuracy: 96.51% Recall (%) Precision (%) Positive examples 95.16 89.17 Negative examples 96.88 98.67 25 Latent Variable Models: Architecture [Shin & Zhang, 2000] Maximize log-likelihood N M L = ∑∑ n(d n , wm ) log P(d n , wm ) n =1 m =1 Document N M K Clustering = ∑∑ n(d n , wm ) log ∑ P( zk ) P( wm | zk ) P(d n | zk ) n =1 m =1 k =1 Topic-Words Extraction 4Update P( zk ) , P(wm | zk ) , P(d n | zk ) . 4With EM Algorithm Latent Variable Model for Topic Words Extraction and Document Clustering 26
  • 14. Latent Variable Models: Learning EM (Expectation-Maximization) Algorithm 4Algorithm to maximize pre-defined log-likelihood Iteration of E-Step and M-Step 4E-Step M-Step N P( zk | d n , wm ) ∑ n( d n , wm ) P( z k | d n , wm ) P( zk ) P(d n | zk ) P( wm | zk ) P( wm | z k ) = n =1 = M N ∑∑ n(d K , wm ) P( z k | d n , wm ) ∑ P( z ) P( d k =1 k n | zk ) P( wm | zk ) m =1 n =1 n M ∑ n( d n , wm ) P( z k | d n , wm ) P(d n | z k ) = m =1 M N ∑∑ n(d m =1 n =1 n , wm ) P( z k | d n , wm ) 1 M N P( z k ) = ∑∑ n(d n , wm ) P( z k | d n , wm ), R m=1 n =1 M N 27 R ≡ ∑∑ n(d n , wm ) m =1 n =1 Latent Variable Models: Applications in IR – Experimental Results Topic Words Extraction and Document Clustering with a subset of TREC-8 data TREC-8 adhoc task data 4 Documents: DTDS, FR94, FT, FBIS, LATIMES 4 Topics: 401-450 (401, 434, 439, 450) 4 401: Foreign Minorities, Germany 4 434: Estonia, Economy 4 439: Inventions, Scientific discovery 4 450: King Hussein, Peace 28
  • 15. Latent Variable Models: Applications in IR – Experimental Results Label (assigned to zk with Maximum P(di|zk) ) Topic (#Docs) z2 z4 z3 z1 Precision Recall 401 (300) 279 1 0 20 0.902 0.930 434 (347) 20 238 10 79 0.996 0.686 439 (219) 7 0 203 9 0.953 0.927 450 (293) 3 0 0 290 0.729 0.990 Topics Extracted Topic Words (top 35 words with highest P(wj|zk) Cluster 2 german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation, (z2 ) law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member, attack, … Cluster 4 percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian, (z4) econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, … Cluster 3 research, technology, develop, mar, materi, system, nuclear, environment, electr, process, (z3) product, power, energi, countrol, japan, pollution, structur, chemic, plant, … Cluster 1 jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti, (z1) negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat, jerusalem, rabin, syria, iraq, … 29 Boosting: Algorithms A general method of converting rough rules into a highly accurate prediction rule Learning procedure 4 Examine the training set 4 Derive a rough rule (weak learner) 4 Re-weight the examples in the training set, concentrating on the hard cases for previous rules 4 Repeat T times Importance weights of training documents Learner Learner Learner Learner h1 h2 h3 h4 f ( h1 , h2 , h3 , h4 ) 30
  • 16. Boosting: Applied to Text Filtering Naïve Bayes 4 Traditional algorithm for text filtering c NM = arg max P (c j ) P ( d i | c j ) c j ∈{ relevant , irrelevant } Assume independence n = arg max P (c j )∏ P ( wik | c j ) among terms cj k =1 = arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ... cj P ( win =" trouble"| c j ) Boosting naïve Bayes 4 Using naïve Bayes classifiers as weak learners 4 [Kim & Zhang, SIGIR-2000] 31 Boosting: Applied to Text Filtering – Experimental Results TREC (Text Retrieval Conference) TREC-8 filtering datasets 4 Sponsored by NIST 4 Training Documents TREC-7 filtering datasets • Financial Times (1991~1992) 4 Training Documents • 167 MB, 64139 documents • AP articles (1988) 4 Test Documents • 237 MB, 79919 documents • Financial Time (1993~1994) 4 Test Documents • 382 MB, 140651 documents • AP articles (1989~1990) • 471 MB, 162999 documents 4 No. of topics: 50 4 No. of topics: 50 Example of a document 32
  • 17. Boosting: Applied to Text Filtering – Experimental Results Compared with the state-of-the-art text filtering systems TREC-7 Averaged Scaled F1 Averaged Scaled F3 Boosting ATT NTT PIRC Boosting ATT NTT PIRC 0.474 0.461 0.452 0.500 0.467 0.460 0.505 0.509 TREC-8 Averaged Scaled LF1 Averaged Scaled LF2 Boosting PLT1 PLT2 PIRC Boosting CL PIRC Mer 0.717 0.712 0.713 0.714 0.722 0.721 0.734 0.720 33 Evolutionary Learning: Applications in IR - Web-Document Retrieval [Kim & Zhang, 2000] Link Information, HTML Tag Retrieval Information <TITLE> <H> <B> … <A> w11 w22 w33 … wnn w1 w2 w3 … wn w w w … w Fitness chromosomes w11 w22 w33 … wnn w1 w2 w3 … wn w w w … w 34
  • 18. Evolutionary Learning: Applications in IR – Tag Weighting Crossover Mutation chromosome X chromosome Y chromosome X x1 x2 x3 … xn y1 y2 y3 … yn x1 x2 x3 … xn zi = (xi + yi ) / 2 w.p. Pc change value w.p. Pm z1 z2 z3 … zn x1 x2 x3 … xn chromosome Z (offspring) chromosome X’ Truncation selection 35 Evolutionary Learning : Applications in IR - Experimental Results Datasets 4 TREC-8 Web Track Data 4 2GB, 247491 web documents (WT2g) 4 No. of training topics: 10, No. of test topics: 10 Results 36
  • 19. Reinforcement Learning: Basic Concept Agent 1. State st Reward rt 2. Action at 3. Reward rt+1 Environment 4. State st+1 37 Reinforcement Learning: Applications in IR - Information Filtering [Seo & Zhang, 2000] •retrieve documents •calculate similarity WAIR 2. Actioni (modify profile) 1. Statei Rewar User profile (user profile) di Document filtering 3. Rewardi+1 (relevance feedback) User ... Filtered documents 4. Statei+1 38
  • 20. Reinforcement Learning: Experimental Results (Explicit Feedback) (%) 39 Reinforcement Learning: Experimental Results (Implicit Feedback) (%) 40