1. Machine Learning Methods
for
Text / Web Data Mining
Byoung-Tak Zhang
School of Computer Science and Engineering
Seoul National University
E-mail: btzhang@cse.snu.ac.kr
This material is available at
http://scai.snu.ac.kr./~btzhang/
Overview
Introduction
4Web Information Retrieval
4Machine Learning (ML)
4ML Methods for Text/Web Data Mining
Text/Web Data Analysis
4Text Mining Using Helmholtz Machines
4Web Mining Using Bayesian Networks
Summary
4Current and Future Work
2
2. Web Information Retrieval
Preprocessing and Indexing
Classification System
Text Data
Text Classification
Information Extraction
Information Filtering DB Template Filling
& Information
Information Filtering System Extraction System
DB Record
Location
Date
user profile
question
filtered data
answer
feedback
DB 3
Machine Learning
Supervised Learning
4Estimate an unknown mapping from known input-
output pairs
4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x)
4Classification: y is discrete
4Regression: y is continuous
Unsupervised Learning
4Only input values are provided
4Learn fw from D={(x)} s.t. f w (x) = x
4Density Estimation
4Compression, Clustering
4
3. Machine Learning Methods
Neural Networks
4 Multilayer Perceptrons (MLPs)
4 Self-Organizing Maps (SOMs)
4 Support Vector Machines (SVMs)
Probabilistic Models
4 Bayesian Networks (BNs)
4 Helmholtz Machines (HMs)
4 Latent Variable Models (LVMs)
Other Machine Learning Methods
4 Evolutionary Algorithms (EAs)
4 Reinforcement Learning (RL)
4 Boosting Algorithms
4 Decision Trees (DTs)
5
ML for Text/Web Data Mining
Bayesian Networks for Text Classification
Helmholtz Machines for Text Clustering/Categorization
Latent Variable Models for Topic Word Extraction
Boosted Learning for TREC Filtering Task
Evolutionary Learning for Web Document Retrieval
Reinforcement Learning for Web Filtering Agents
Bayesian Networks for Web Customer Data Mining
6
4. Preprocessing for Text Learning
0 baseball
From: xxx@sciences.sdsu.edu
0 car
Newsgroups:comp.graphics
Subject: Need specs on Apple 0 clinton
QT 0 computer
0 graphics
I need to the specs, or at least a
very verbose interpretation of the 0 hockey
specs, for QuckTime. Technical 2 quicktime
articles from magazines and
.
references to books would be
nice, too. .
1 references
I also need the specs in a format
0 space
usable on a Unix or MS-Dos
system. I can’t do much with the 3 specs
QuickTime stuff they have on.. 1 unix
. 7
Text Mining: Data Sets
Usenet Newsgroup Data
420 categories
41000 documents for each category
420000 documents in total.
TDT2 Corpus
4Target detection and tracking (TDT): NIST
4Used 6,169 documents in experiments
8
5. Text Mining:
Helmholtz Machine Architecture
…
h1 h2 hm : recognition weight
: generative weight
1
P (hi = 1) =
n
1 + exp − bi − ∑ wij d j
j =1
d1 d2 d3 … dn P (d i = 1) =
1
m
1 + exp − bi − ∑ wij h j
j =1
4 [Chang and Zhang, 2000] 4Latent nodes
4 Input nodes • Binary values
• Binary values • Extract the underlying causal
• Represent the existence or structure in the document set.
absence of words in documents. • Capture correlations of the words 9
in documents.
Text Mining:
Learning Helmholtz Machines
4Introduce a recognition network for estimation of a
generative network.
( ) T
( ) ( ( ) )
P d (t ) ,α (t ) | θ
T
log( D | θ ) = ∑ log ∑ P d ( t ) , α (t ) | θ = ∑ log ∑ Q α ( t )
t =1 α ( t ) t =1 α (t ) Q α (t )
T
≥ ∑∑ Q α log( ) (
(t ) )
P d (t ) ,α (t ) | θ
t =1 α ( t ) ( )
Q α (t )
4Wake-Sleep Algorithm
• Train the recognition and generative models alternately.
• Update the weight in network iteratively by simple local delta rule.
wij = wij + ∆wij
new old
∆wij = γ si ( s j − p( s j = 1))
10
6. Text Mining: Methods
Text Categorization
4 Train a Helmholtz machine for each category.
4 Total N machines for N categories.
4 Once the N machines have been estimated, classification of a test
document proceeds by estimating the likelihood of the document
for each machine.
c = arg max[log P(d | c)]
ˆ
c∈C
Topic Words Extraction
4 For the entire document sets, train a Helmholtz machine.
4 After training, examine the weights of connections from a latent
node to input nodes, that is words.
11
Text Mining: Categorization Results
4Usenet Newsgroup Data
• 20 categories, 1000 documents for each category, 20000 documents in
total.
12
7. Text Mining: Topic Words Extraction Results
4TDT2 Corpus
46,169 documents
1 tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney,
smokers, lawsuit, senate, cigarette, morris, nicotine
2 warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds,
mordechai, separatist, governor
3 olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze,
skating, lillehammer, downhill
4 netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,
jerusalem, eu, paris, israel
5 India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata,
bharatiya, islamabad, bihari
6 Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation,
jakarta, rioting, electoral, rallies, wiranto, unrest, megawati
7 imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand,
inflation, investors, fund, banks, baht
8 pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output,
vatican, freedom
13
Web Mining: Customer Analysis
KDD-2000 Web Mining Competition
4Data: 465 features over 1700 customers
• Features include friend promotion rate, date visited,
weight of items, price of house, discount rate, …
• Data was collected during Jan. 30 – March 30, 2000
• Friend promotion was started from Feb. 29 with TV
advertisement.
4Aims: Description of heavy/low spenders
14
9. Web Mining: Feature Selection
Features selected by various ways [Yang & Zhang, 2000]
DecisionTree+Factor Decision Tree Discriminant Model
Analysis
V368 (Weight Average) V13 (SendEmail) V240 (Friend)
V243 (OrderLine Quantity Sum) V234 (OrderItemQuantity Sum% V229 (Order-Average)
V245 (OrderLine Quantity HavingDiscountRange(5 . 10)) V304 (OrderShippingAmtMin.)
Maximum) V237 (OrderItemQuantitySum% V368 (Weight Average)
F1 = 0.94*V324 + 0.868*V374 Having DiscountRange(10.)) V43 (Home Market Value)
+ 0.898*V412 V240 (Friend) V377 (NumAcountTemplate Views)
F2 = 0.829*V234 + V243 (OrderLineQuantitySum) +
0.857*V240
V245 (OrderLineQuantity Maximum) V11 (Which
F3 = -0.795*V237+
0.778*V304 V304 (OrderShippingAmtMin) DoYouWearMostFrequent)
V324 (NumLegwearProduct V13 (SendEmail)
Views) V17 (USState)
V368 (Weight Average) V45 (VehicleLifeStyle)
V374 (NumMainTemplateViews) V68 (RetailActivity)
V412 (NumReplenishable V19 (Date)
Stock Views) 17
Web Mining: Bayesian Nets
Bayesian network
4 DAG (Directed Acyclic Graph)
4 Express dependence relations between variables
4 Can use prior knowledge on the data (parameters)
A B C
P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)
D E
4 Examples of conjugate priors:
Dirichlet for multinomial data, Normal-Wishart for normal data 18
10. Web Mining: Results
A Bayesian net for
KDD web data
V229 (Order-Average) and
V240 (Friend) directly
influence V312 (Target)
V19 (Date) was influenced by
V240 (Friend) reflecting the
TV advertisement.
19
Summary
We study machine learning methods, such as
4 Probabilistic neural networks
4 Evolutionary algorithms
4 Reinforcement learning
Application areas include
4 Text mining
4 Web mining
4 Bioinformatics (not addressed in this talk)
Recent work focuses on probabilistic graphical models for
web/text/bio data mining, including
4 Bayesian networks
4 Helmholtz machines
4 Latent variable models
20
11. 21
Bayesian Networks:
Architecture
B L
G M
P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G )
= P ( L) P ( B ) P (G | B ) P ( M | B, L)
A Bayesian network represents the probabilistic
relationships between the variables.
n
P ( X) = ∏ P ( X i | pa i ) pai is the set of parent nodes of Xi. 22
i =1
12. Bayesian Networks:
Applications in IR – A Simple BN for Text Classification
C
• C: document class
• ti: ith term
t1 t2 t8754
The network structure represents the naïve Bayes assumption.
All nodes are binary.
[Hwang & Zhang, 2000] 23
Bayesian Networks:
Experimental Results
Dataset
4The acq dataset from Reuters-21578
48754 terms were selected by TFIDF.
4Training data: 8762 documents
4Test data: 3009 documents
Parametric Learning
4Dirichlet prior assumptions for the network parameter
distributions.
p (θ ij | S h ) = Dir (θ ij | α ij1 ,..., α ijri )
4Parameter distributions are updated with training data.
p(θ ij | D, S h ) = Dir(θ ij | αij1 + Nij1 ,...,αijri + Nijri )
24
13. Bayesian Networks:
Experimental Results
For training data
4Accuracy: 94.28%
Recall (%) Precision (%)
Positive examples 96.83 75.98
Negative examples 93.76 99.32
For test data
4Accuracy: 96.51%
Recall (%) Precision (%)
Positive examples 95.16 89.17
Negative examples 96.88 98.67 25
Latent Variable Models:
Architecture
[Shin & Zhang, 2000]
Maximize log-likelihood
N M
L = ∑∑ n(d n , wm ) log P(d n , wm )
n =1 m =1 Document
N M K Clustering
= ∑∑ n(d n , wm ) log ∑ P( zk ) P( wm | zk ) P(d n | zk )
n =1 m =1 k =1
Topic-Words
Extraction
4Update P( zk ) , P(wm | zk ) , P(d n | zk ) .
4With EM Algorithm
Latent Variable Model for
Topic Words Extraction and Document Clustering
26
14. Latent Variable Models:
Learning
EM (Expectation-Maximization) Algorithm
4Algorithm to maximize pre-defined log-likelihood
Iteration of E-Step and M-Step
4E-Step M-Step
N
P( zk | d n , wm )
∑ n( d n , wm ) P( z k | d n , wm )
P( zk ) P(d n | zk ) P( wm | zk ) P( wm | z k ) = n =1
= M N
∑∑ n(d
K
, wm ) P( z k | d n , wm )
∑ P( z ) P( d
k =1
k n | zk ) P( wm | zk ) m =1 n =1
n
M
∑ n( d n , wm ) P( z k | d n , wm )
P(d n | z k ) = m =1
M N
∑∑ n(d
m =1 n =1
n , wm ) P( z k | d n , wm )
1 M N
P( z k ) = ∑∑ n(d n , wm ) P( z k | d n , wm ),
R m=1 n =1
M N
27
R ≡ ∑∑ n(d n , wm )
m =1 n =1
Latent Variable Models:
Applications in IR – Experimental Results
Topic Words Extraction and Document Clustering with a
subset of TREC-8 data
TREC-8 adhoc task data
4 Documents: DTDS, FR94, FT, FBIS, LATIMES
4 Topics: 401-450 (401, 434, 439, 450)
4 401: Foreign Minorities, Germany
4 434: Estonia, Economy
4 439: Inventions, Scientific discovery
4 450: King Hussein, Peace
28
15. Latent Variable Models:
Applications in IR – Experimental Results
Label (assigned to zk with Maximum P(di|zk) )
Topic (#Docs) z2 z4 z3 z1 Precision Recall
401 (300) 279 1 0 20 0.902 0.930
434 (347) 20 238 10 79 0.996 0.686
439 (219) 7 0 203 9 0.953 0.927
450 (293) 3 0 0 290 0.729 0.990
Topics Extracted Topic Words (top 35 words with highest P(wj|zk)
Cluster 2 german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation,
(z2 ) law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member,
attack, …
Cluster 4 percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian,
(z4) econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, …
Cluster 3 research, technology, develop, mar, materi, system, nuclear, environment, electr, process,
(z3) product, power, energi, countrol, japan, pollution, structur, chemic, plant, …
Cluster 1 jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti,
(z1) negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat,
jerusalem, rabin, syria, iraq, …
29
Boosting:
Algorithms
A general method of converting rough rules into a highly accurate
prediction rule
Learning procedure
4 Examine the training set
4 Derive a rough rule (weak learner)
4 Re-weight the examples in the training set, concentrating on the hard cases for
previous rules
4 Repeat T times
Importance weights
of training documents
Learner Learner Learner Learner
h1 h2 h3 h4 f ( h1 , h2 , h3 , h4 )
30
16. Boosting:
Applied to Text Filtering
Naïve Bayes
4 Traditional algorithm for text filtering
c NM = arg max P (c j ) P ( d i | c j )
c j ∈{ relevant , irrelevant } Assume independence
n
= arg max P (c j )∏ P ( wik | c j )
among terms
cj k =1
= arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ...
cj
P ( win =" trouble"| c j )
Boosting naïve Bayes
4 Using naïve Bayes classifiers as weak learners
4 [Kim & Zhang, SIGIR-2000]
31
Boosting:
Applied to Text Filtering – Experimental Results
TREC (Text Retrieval Conference) TREC-8 filtering datasets
4 Sponsored by NIST 4 Training Documents
TREC-7 filtering datasets • Financial Times (1991~1992)
4 Training Documents • 167 MB, 64139 documents
• AP articles (1988)
4 Test Documents
• 237 MB, 79919 documents
• Financial Time (1993~1994)
4 Test Documents
• 382 MB, 140651 documents
• AP articles (1989~1990)
• 471 MB, 162999 documents 4 No. of topics: 50
4 No. of topics: 50
Example of a document
32
17. Boosting:
Applied to Text Filtering – Experimental Results
Compared with the state-of-the-art text filtering systems
TREC-7
Averaged Scaled F1 Averaged Scaled F3
Boosting ATT NTT PIRC Boosting ATT NTT PIRC
0.474 0.461 0.452 0.500 0.467 0.460 0.505 0.509
TREC-8
Averaged Scaled LF1 Averaged Scaled LF2
Boosting PLT1 PLT2 PIRC Boosting CL PIRC Mer
0.717 0.712 0.713 0.714 0.722 0.721 0.734 0.720 33
Evolutionary Learning:
Applications in IR - Web-Document Retrieval
[Kim & Zhang, 2000]
Link Information,
HTML Tag Retrieval
Information
<TITLE> <H> <B> … <A>
w11 w22 w33 … wnn
w1 w2 w3 … wn
w w w … w Fitness
chromosomes w11 w22 w33 … wnn
w1 w2 w3 … wn
w w w … w 34
18. Evolutionary Learning:
Applications in IR – Tag Weighting
Crossover Mutation
chromosome X chromosome Y chromosome X
x1 x2 x3 … xn y1 y2 y3 … yn x1 x2 x3 … xn
zi = (xi + yi ) / 2 w.p. Pc change value w.p. Pm
z1 z2 z3 … zn x1 x2 x3 … xn
chromosome Z (offspring) chromosome X’
Truncation selection
35
Evolutionary Learning :
Applications in IR - Experimental Results
Datasets
4 TREC-8 Web Track Data
4 2GB, 247491 web documents (WT2g)
4 No. of training topics: 10, No. of test topics: 10
Results
36
19. Reinforcement Learning:
Basic Concept
Agent
1. State st Reward rt 2. Action at
3. Reward rt+1
Environment
4. State st+1 37
Reinforcement Learning:
Applications in IR - Information Filtering
[Seo & Zhang, 2000]
•retrieve documents
•calculate similarity
WAIR
2. Actioni
(modify profile)
1. Statei Rewar User profile
(user profile) di
Document filtering
3. Rewardi+1 (relevance
feedback)
User ...
Filtered documents
4. Statei+1
38