SlideShare a Scribd company logo
1 of 96
Supervised learning for text
Mining the Web Chakrabarti & Ramakrishnan 2
Organizing knowledge
 Systematic knowledge structures
 Ontologies
• Dewey decimal system, the Library of
Congress catalog, the AMS Mathematics
Subject
• Classification, and the US Patent subject
classification
 Web catalogs
• Yahoo & Dmoz
 Problem: Manual maintenance
Mining the Web Chakrabarti & Ramakrishnan 3
Topic Tagging
 Finding similar documents
 Guiding queries
 Naïve Approach:
• Syntactic similarity between documents
 Better approach
• Topic tagging
Mining the Web Chakrabarti & Ramakrishnan 4
Topic Tagging
 Advantages
• Increase vocabulary of classes
• Hierarchical visualization and browsing aids
 Applications
• Email/Bookmark organization
• News Tracking
• Tracking authors of anonymous texts
 E.g.: The Flesch-Kincaid index
• classify the purpose of hyperlinks.
Mining the Web Chakrabarti & Ramakrishnan 5
Supervised learning
 Learning to assign objects to classes
given examples
 Learner (classifier)
A typical supervised text learning scenario.
Mining the Web Chakrabarti & Ramakrishnan 6
Difference with texts
 M.L classification techniques used for
structured data
 Text: lots of features and lot of noise
 No fixed number of columns
 No categorical attribute values
 Data scarcity
 Larger number of class label
 Hierarchical relationships between classes
less systematic unlike structured data
Mining the Web Chakrabarti & Ramakrishnan 7
Techniques
 Nearest Neighbor Classifier
• Lazy learner: remember all training instances
• Decision on test document: distribution of labels on
the training documents most similar to it
• Assigns large weights to rare terms
 Feature selection
• removes terms in the training documents which are
statistically uncorrelated with the class labels,
 Bayesian classifier
• Fit a generative term distribution Pr(d|c) to each class
c of documents {d}.
• Testing: The distribution most likely to have generated
a test document is used to label it.
Mining the Web Chakrabarti & Ramakrishnan 8
Other Classifiers
 Maximum entropy classifier:
• Estimate a direct distribution Pr(cjd) from term space
to the probability of various classes.
 Support vector machines:
• Represent classes by numbers
• Construct a direct function from term space to the
class variable.
 Rule induction:
• Induce rules for classification over diverse features
• E.g.: information from ordinary terms, the structure of
the HTML tag tree in which terms are embedded, link
neighbors, citations
Mining the Web Chakrabarti & Ramakrishnan 9
Other Issues
 Tokenization
• E.g.: replacing monetary amounts by a
special token
 Evaluating text classifier
• Accuracy
• Training speed and scalability
• Simplicity, speed, and scalability for document
modifications
• Ease of diagnosis, interpretation of results,
and adding human judgment and feedback
subjective
Mining the Web Chakrabarti & Ramakrishnan 10
Benchmarks for accuracy
 Reuters
• 10700 labeled documents
• 10% documents with multiple class labels
 OHSUMED
• 348566 abstracts from medical journals
 20NG
• 18800 labeled USENET postings
• 20 leaf classes, 5 root level classes
 WebKB
• 8300 documents in 7 academic categories.
 Industry
• 10000 home pages of companies from 105 industry
sectors
• Shallow hierarchies of sector names
Mining the Web Chakrabarti & Ramakrishnan 11
Measures of accuracy
 Assumptions
• Each document is associated with exactly one
class.
OR
• Each document is associated with a subset of
classes.
 Confusion matrix (M)
• For more than 2 classes
• M[i; j] : number of test documents belonging
to class i which were assigned to class j
• Perfect classifier: diagonal elements M[i; i]
would be nonzero.
Mining the Web Chakrabarti & Ramakrishnan 12
Evaluating classifier accuracy
 Two-way ensemble
• To avoid searching over the power-set of class labels in the
subset scenario
• Create positive and negative classes for each
document d (E.g.: “Sports” and “Not sports” (all remaining
documents)
 Recall and precision
• contingency matrix per (d,c) pair
22
]coutputnotdoesclassierandCc[[1,1]M
]coutputsclassierandCc[[1,0]M
]coutputnotdoesclassierandCc[[0,1]M
]coutputsclassierandCc[[0,0]M
dcd,
dcd,
dcd,
dcd,




)( dC )( dC
Mining the Web Chakrabarti & Ramakrishnan 13
Evaluating classifier accuracy
(contd.)
• micro averaged contingency matrix
• micro averaged contingency matrix
• micro averaged precision and recall
 Equal importance for each document
• Macro averaged precision and recall
 Equal importance for each class

cd
cdMM
,
,
]0,1[]0,0[
]0,0[
)(



MM
M
precisionM


]1,0[]0,0[
]0,0[
)(



MM
M
recallM



c d
dcc M
C
M ,
||
1
]0,1[]0,0[
]0,0[
)(
cc
c
c
MM
M
precisionM


]1,0[]0,0[
]0,0[
)(
cc
c
c
MM
M
recallM


Mining the Web Chakrabarti & Ramakrishnan 14
Evaluating classifier accuracy
(contd.)
• Precision – Recall tradeoff
 Plot of precision vs. recall: Better classifier has
higher curvature
 Harmonic mean : Discard classifiers that sacrifice
one for the other
precisionrecall
precisionrecall2
F1



Mining the Web Chakrabarti & Ramakrishnan 15
Nearest Neighbor classifiers
 Intuition
• similar documents are expected to be
assigned the same class label.
• Vector space model + cosine similarity
• Training:
 Index each document and remember class label
• Testing:
 Fetch “k” most similar document to given
document
– Majority class wins
– Alternative: Weighted counts – counts of classes
weighted by the corresponding similarity measure
– Alternative: per-class offset bc which is tuned by testing
the classier on a portion of training data held out for this
purpose.
Mining the Web Chakrabarti & Ramakrishnan 16
Nearest neighbor classification
Mining the Web Chakrabarti & Ramakrishnan 17
Pros
 Easy availability and reuse of of inverted
index
 Collection updates trivial
 Accuracy comparable to best known
classifiers
Mining the Web Chakrabarti & Ramakrishnan 18
Cons
 Iceberg category questions
• involves as many inverted index lookups as
there are distinct terms in dq,
• scoring the (possibly large number of)
candidate documents which overlap with dq in
at least one word,
• sorting by overall similarity,
• picking the best k documents,
 Space overhead and redundancy
• Data stored at level of individual documents
• No distillation
Mining the Web Chakrabarti & Ramakrishnan 19
Workarounds
 To reducing space requirements and
speed up classification
• Find clusters in the data
• Store only a few statistical parameters per
cluster.
• Compare with documents in only the most
promising clusters.
 Again….
• Ad-hoc choices for number and size of
clusters and parameters.
• k is corpus sensitive
Mining the Web Chakrabarti & Ramakrishnan 20
TF-IDF
 TF-IDF done for whole corpus
 Interclass correlations and term
frequencies unaccounted for
 Terms which occur relatively frequently in
some classes compared to others should
have higher importance
 Overall rarity in the corpus is not as
important.
Mining the Web Chakrabarti & Ramakrishnan 21
Feature selection
 Data sparsity:
• Term distribution could be estimated if training
set larger than test
• Not the case however…….
• Vocabulary documents
• For Reuters, only about 10300 documents
available.
 Over-fitting problem
• Joint distribution may fit training instances…..
• But may not fit unforeseen test data that well
||
2W
W 
Mining the Web Chakrabarti & Ramakrishnan 22
Marginals rather than joint
 Marginal distribution of each term in each
class
 Empirical distributions may not still reflect
actual distributions if data is sparse
 Therefore feature selection
• Purposes:
 Improve accuracy by avoiding over fitting
 maintain accuracy while discarding as many
features as possible to save a great deal of space
for storing statistics
• Heuristic, guided by linguistic and domain
knowledge, or statistical.
Mining the Web Chakrabarti & Ramakrishnan 23
Feature selection
 Perfect feature selection
• goal-directed
• pick all possible subsets of features
• for each subset train and test a classier
• retain that subset which resulted in the highest accuracy.
• COMPUTATIONALLY INFEASIBLE
 Simple heuristics
• Stop words like “a”, “an”, “the” etc.
• Empirically chosen thresholds (task and corpus sensitive) for ignoring
“too frequent” or “too rare” terms
• Discard “too frequent” and “too rare terms”
 Larger and complex data sets
• Confusion with stop words
• Especially for topic hierarchies
 Greedy inclusion (bottom up) vs. top-down
Mining the Web Chakrabarti & Ramakrishnan 24
Greedy inclusion algorithm
 Most commonly used in text
 Algorithm:
1. Compute, for each term, a measure of
discrimination amongst classes.
2. Arrange the terms in decreasing order of this
measure.
3. Retain a number of the best terms or features for
use by the classier.
• Greedy because
• measure of discrimination of a term is computed
independently of other terms
• Over-inclusion: mild effects on accuracy
Mining the Web Chakrabarti & Ramakrishnan 25
Measure of discrimination
• Dependent on
• model of documents
• desired speed of training
• ease of updates to documents and class
assignments.
• Observations
• sets included for acceptable accuracy tend to
have large overlap.
Mining the Web Chakrabarti & Ramakrishnan 26
The test
• Similar to the likelihood ratio test
• Build a 2 x 2 contingency matrix per class-
term pair
 Under the independence hypothesis
• Aggregates the deviations of observed values
from expected values
• Larger the value of , the lower is our belief
that the independence assumption is upheld
by the observed data.
2

ttermcontainingiclassindocumentsofnumberk
ttermcontainingnoticlassindocumentsofnumberk
i,1
i,0


2

Mining the Web Chakrabarti & Ramakrishnan 27
The test
• Feature selection process
• Sort terms in decreasing order of their
values,
• Train several classifier with varying number of
features
• Stopping at the point of maximum accuracy.
2

2

))()()((
)(
)Pr()Pr(
)Pr()Pr(
0010011100011011
2
01100011
,
,2
kkkkkkkk
kkkkn
mIlCn
mIlCnk
ml t
tml





 
Mining the Web Chakrabarti & Ramakrishnan 28
Mutual information
• Useful when the multinomial document model is
used
• X and Y are discrete random variables taking
values x,y
• Mutual information (MI) between them is defined as
• Measure of extent of dependence between
random variables,
• Extent to which the joint deviates from the product of
the marginals
• Weighted with the distribution mass at (x; y)

x y yx
yx
yxYXMI
)Pr()Pr(
),Pr(
log),Pr(),(
Mining the Web Chakrabarti & Ramakrishnan 29
Mutual Information
 Advantages
• To the extent MI(X,Y) is large, X and Y are
dependent.
• Deviations from independence at rare values of (x,y)
are played down
• Interpretations
• Reduction in the entropy of Y given X.
• MI(X; Y ) = H(X) – H(X|Y) = H(Y) – H(Y|X)
• KL distance between no-independence hypothesis
and independence hypothesis
• KL distance gives the average number of bits wasted by
encoding events from the `correct‘ distribution using a code
based on a not-quite-right distribution
Mining the Web Chakrabarti & Ramakrishnan 30
Feature selection with MI
• Fix a term t and let be an event associated
with that term.
• E.g.: For the binary model, = 0/1,
• Pr( ) = the empirical fraction of documents in
the training set in which event it occurred.
• Pr( ,c) = the empirical fraction of training
documents which are in class c
• Pr(c) = fraction of training documents belonging
to class c.
• Formula:
• Problem : document lengths are not normalized.
tI
tI
tI
tI
2
,1,01,0,
,
,
,
/))((
/
log),(
nkkkk
nk
n
k
CIMI
mmll
ml
ml
ml
t

 
Mining the Web Chakrabarti & Ramakrishnan 31
Fisher's discrimination index
• Useful when documents are scaled to
constant length
• Term occurrences are regarded as
fractional real numbers.
• E.g.: Two class case
• Let X and Y be the sets of length normalized
document vectors corresponding to the two classes.
• Let and be centroids for each
class.
• Covariance matrices be
||
)(
X
x
X
X


||
)(
Y
y
Y
Y


 
X
T
XXX xxXS ))((|)|/1(   
Y
T
YYY yyYS ))((|)|/1( 
Mining the Web Chakrabarti & Ramakrishnan 32
Fisher's discrimination index
(contd.)
• Goal : find a projection of the data sets X and Y
on to a line such that
• the two projected centroids are far apart compared to
the spread of the point sets projected on to the same
line.
• Find a column vector such that
 the ratio of
– the square of the difference in mean vectors projected
onto it
– & average projected variance
 is maximized.
• This gives
m
*

2
))(( YX
T
 
 )(
2
1
YX
T
SS 



 )(
))((
maxarg
2
*
YX
T
YX
T
SS 


Mining the Web Chakrabarti & Ramakrishnan 33
Fisher's discrimination index
• Formula
• Let X and Y for both the training and test data
are generated from multivariate Gaussian
distributions
• Let
• Then this value of induces the optimal
(minimum error) classier by suitable
thresholding on for a test point q.
• Problems
• Inverting S would be unacceptably slow for
tens of thousands of dimensions.
• Llinear transformations would destroy already
existing sparsity.
YX SS 
qT

Mining the Web Chakrabarti & Ramakrishnan 34
Solution
• Recall:
• Goal was to eliminate terms from
consideration.
• Not to arrive at linear projections involving
multiple terms
• Regard each term t as providing a
candidate direction t which is parallel to
the corresponding axis in the vector space
model.
• Compute the Fisher's index of t
Mining the Web Chakrabarti & Ramakrishnan 35
FI : Solution (contd.)
• Formula
• For two class case
• Can be generalized to a set {c} of more than
two classes
• Feature selection
• Terms are sorted in decreasing order of FI(t)
• Best ones chosen as features.
  




X Y
tYttXt
tYtX
t
T
t
YX
T
yYxXS
tFI
t
2
,
2
,
2
,,
2
)(|)|/1()(|)|/1(
)())((
)(




 





c Dd
tctd
c
cc
tctc
c
x
D
tFI
2
,,
2
,
,,
)(
||
1
)(
)( 21
21


Mining the Web Chakrabarti & Ramakrishnan 36
Validation
• How to decide a cut-off rank ?
• Validation approach
• A portion of the training documents are held out
• The rest is used to do term ranking
• The held-out set used as a test set.
• Various cut-off ranks can be tested using the
same held-out set.
• Leave-one-out cross-validation/partitioning data
into two
• An aggregate accuracy is computed over all
trials.
• Wrapper to search for the number of features
• In decreasing order of discriminative power
Mining the Web Chakrabarti & Ramakrishnan 37
Validation (contd.)
• Simple search heuristic
• Keep adding one feature at every step until
the classifier's accuracy ceases to improve.
A general illustration of wrapping for feature selection.
Mining the Web Chakrabarti & Ramakrishnan 38
Validation (contd.)
• For naive Bayes-like classier
• Evaluation on many choices of feature sets
can be done at once.
• For Maximum Entropy/Support vector
machines
• Essentially involves training a classier from
scratch for each choice of the cut-off rank.
• Therefore inefficient
Mining the Web Chakrabarti & Ramakrishnan 39
Validation : observations
• Bayesian classifier cannot over fit much
Effect of feature selection on Bayesian classifiers
Mining the Web Chakrabarti & Ramakrishnan 40
Truncation algorithms
• Start from the complete set of terms T
1. Keep selecting terms to drop
2. Till you end up with a feature subset
3. Question: When should you stop truncation
?
• Two objectives
• minimize the size of selected feature set F.
• Keep the distorted distribution Pr(C|F) as
similar as possible to the original Pr(CjT)
Mining the Web Chakrabarti & Ramakrishnan 41
Truncation Algorithms: Example
• Kullback-Leibler (KL)
• Measures similarity or distance between two
distributions
• Markov Blanket
• Let X be a feature in T. Let
• The presence of M renders the presence of X
unnecessary as a feature => M is a Markov blanket
for X
• Technically
• M is called a Markov blanket for if X is
conditionally independent of
given M
• eliminating a variable because it has a Markov blanket
contained in other existing features does not increase the KL
distance between Pr(C|T) and Pr(C|F).
}{ XTM 
TX 
)}{()( XMCT 
Mining the Web Chakrabarti & Ramakrishnan 42
Finding Markov Blankets
• Absence of Markov Blanket in practice
• Finding approximate Markov blankets
• Purpose: To cut down computational
complexity
• search for Markov blankets M to those with at
most k features.
• given feature X, search for the members of M
to those features which are most strongly
correlated (using tests similar to the 2 or MI
tests) with X.
• Example : For Reuters dataset, over two-
thirds of T could be discarded while
increasing classification accuracy
Mining the Web Chakrabarti & Ramakrishnan 43
Feature Truncation algorithm
1. while truncated Pr(C|F) is reasonably close to original Pr(C|T)
do
2. for each remaining feature X do
3. Identify a candidate Markov blanket M:
4. For some tuned constant k, find the set M of k
variables in F  X that are most strongly correlated with X
5. Estimate how good a blanket M is
6. Estimate
7. end for
8. Eliminate the feature having the best surviving Markov
blanket
9. end while
 
xx
MM
M
KLxXxX
,
MMMM ))xX|Pr(Cx),X,xX|(C(Pr),Pr(
Mining the Web Chakrabarti & Ramakrishnan 44
General observations on feature
selection• The issue of document length should be addressed
properly.
• Choice of association measures does not make a
dramatic difference
• Greedy inclusion algorithms scale nearly linearly with the
number of features
• Markov blanket technique takes time proportional to at
least .
• Advantage of Markov blankets algo over greedy
inclusion
• Greedy algo may include features with high individual correlations even
though one subsumes the other
• Features individually uncorrelated could be jointly more correlated with
the class
• This rarely happens
• Binary feature selection view may not be only view to
subscribe to
k
T ||
Mining the Web Chakrabarti & Ramakrishnan 45
Bayesian Learner
• Very practical text classifier
• Assumption
1. A document can belong to exactly one of a set of
classes or topics.
2. Each class c has an associated prior probability
Pr(c),
3. There is a class-conditional document distribution
Pr(djc) for each class.
• Posterior probability
• Obtained using Bayes Rule
• Parameter set consists of all P(d|c)



 )|Pr()Pr(
)|Pr()Pr(
)|Pr(
d
cdc
dc

Mining the Web Chakrabarti & Ramakrishnan 46
Parameter Estimation for Bayesian
Learner
• Estimate of is based on two sources of
information:
1. Prior knowledge on the parameter set before seeing any training
documents
2. Terms in the training documents D.
• Bayes Optimal Classifier
• Taking the expectation of each parameter over
Pr( |D)
• Computationally infeasible
• Maximum likelihood estimate
• Replace the sum above with the value of the summand (Pr(c|d, )) for
arg max Pr(D| ),
• Works poorly









 )|Pr(
),|Pr()|Pr(
),|Pr()|Pr(
)|Pr( D
d
cdc
dc


Mining the Web Chakrabarti & Ramakrishnan 47
Naïve Bayes Classifier
• Naïve
• assumption of independence between terms,
• joint term distribution is the product of the
marginals.
• Widely used owing to
• simplicity and speed of training, applying, and
updating
• Two kinds of widely used marginals for
text
• Binary model
• Multinomial model
Mining the Web Chakrabarti & Ramakrishnan 48
Naïve Bayes Models
• Binary Model
• Each parameter indicates the probability that a
document in class c will mention term t at least once.
• Multinomial model
• each class has an associated die with |W| faces.
• each parameter denotes probability of the face
turning up on tossing the die.
• term t occurs n(d; t) times in document d,
• document length is a random variable denoted L,
 .
• .

Ddforaccountto
,
,
,
,
,, )1(
1
)1()|Pr(


 


Wt
tc
dt tc
tc
dtWt
tc
dt
tccd 











dt
tdn
t
d
ddd
tdn
l
clLcldclLcd ),(
)},({
)|Pr(),|Pr()|Pr()|Pr( 
Wt 
Mining the Web Chakrabarti & Ramakrishnan 49
Analysis of Naïve Bayes Models
1. Multiply together a large number of small
probabilities,
• Result: extremely tiny probabilities as
answers.
• Solution : store all numbers as logarithms
2. Class which comes out at the top wins by
a huge margin
• Sanitizing scores using likelihood ration
• Also called the logit function
• . )|1Pr(
)|1Pr(
)(,
1
1
)(log )(
dC
dC
dLR
e
dit dLR




 
Mining the Web Chakrabarti & Ramakrishnan 50
Parameter smoothing
• What if a test document contains a term t that
never occurred in any training document in class
c ?
• Ans : will be zero
• Even if many other terms clearly hint at a high
likelihood of class c generating the document.
• Bayesian Estimation
• Estimating probability from insufficient data.
 If you toss a coin n times and it always comes up heads,
what is the probability that the (n + 1)th toss will also come up
heads?
• posit a prior distribution on , called
 E.g.: The uniform distribution
• Resultant posterior distribution:
qd
)|Pr( qdc
 )(
 

 1
0
)|,Pr()(
)|,Pr()(
),|(
pnkpdp
nk
nk



Mining the Web Chakrabarti & Ramakrishnan 51
Laplace Smoothing
• Based on Bayesian Estimation
• Laplace's law of succession
• loss function (penalty) for picking a
smoothed value as against the `true' value.
• E.g.: Loss function as the square error
• For this choice of loss,the best choice of the
smoothed parameter is simply the expectation
of the posterior distribution on having
observed the data:
• .
)
~
,( L

~
2
1
)),|((
~



n
k
nkE 
Mining the Web Chakrabarti & Ramakrishnan 52
Laplace Smoothing (contd.)
• Heuristic alternatives
• Lidstone's law of succession
• .
• derivation for the multinomial model
• there are |W| possible events where W is the
vocabulary.
• .



2
~



n
k







dDd
Dd
tc
c
c
dnW
tdn



,
,
),(||
),(1
Mining the Web Chakrabarti & Ramakrishnan 53
Performance analysis
• Multinomial naive Bayes classifier
generally outperforms the binary variant
• K-NN may outperform naïve Bayes
• Naïve Bayes is faster and more compact
• decision boundaries:
• regions of potential confusion
Mining the Web Chakrabarti & Ramakrishnan 54
NB: Decision boundaries
• Bayesian classier partitions the
multidimensional term space into regions
• Within each region, the probability of one
class is higher than others
• On the boundaries, the probability of two
or more classes are exactly equal
• NB is a linear classier
• it makes a decision between c = 1 and c = -1
• by thresholding the value of
(b=prior) for a suitable vector
bdNB .
NB
Mining the Web Chakrabarti & Ramakrishnan 55
Pitfalls
• Strong bias
• fixes the policy that (tth component of
the linear discriminant) depends only on the
statistics of term t in the corpus.
• Therefore it cannot pick from the entire set of
possible linear discriminants,
)(tNB
Mining the Web Chakrabarti & Ramakrishnan 56
Bayesian Networks
• Attempt to capture statistical dependencies
between terms themselves
• Approximations to the joint distribution over
terms
• Probability of a term occurring depends on
observation about other terms as well as the class
variable.
• A directed acyclic graph
• All random variables (classes and terms) are nodes
• Dependency edges are drawn from c to t for each
t.(parent-child edges)
• To represent additional dependencies between terms
dependency edges (parent child) are drawn
Mining the Web Chakrabarti & Ramakrishnan 57
Bayesian networks. For the naive Bayes assumption, the only edges are from the class
variable to individual terms. Towards better approximations to the joint distribution over terms:
the probability of a term occurring may now depend on observation about other terms as well as
class variable.
Mining the Web Chakrabarti & Ramakrishnan 58
Bayesian Belief Network (BBN)
• DAG
• Parents Pa(X)
• nodes that are connected by directed edges to a node
X
• Fixing the values of the parent variables
completely determines the conditional
distribution of X
• Conditional Probability tables
• For discrete variables, the distribution data for X can
be stored in the obvious way as a table with each row
showing a set of values of the parents, the value of X,
and a conditional probability.
• Unlike Naïve Bayes
• P(d|c) is not a simple product over all terms.

x
Xpaxx ))(|Pr()Pr(
Mining the Web Chakrabarti & Ramakrishnan 59
BBN: difficulty
• Getting a good network structure.
• At least quadratic time
• Enumeration of all pairs of features
• Exploited only for binary model
• Multinomial model
• Prohibitive CPT sizes
Mining the Web Chakrabarti & Ramakrishnan 60
Exploiting hierarchy among topics
• Ordering between the class labels
• For Data warehousing
• E.g. : high, medium, or low cancer risk patients.
• Text Class labels:
• Taxonomy:
• large and complex class hierarchy that relates the class
labels
• Tree structure
• Simplest form of taxonomy
• widely used in directory browsing,
• often the output of clustering algorithms.
• inheritance:
• If class c0 is the parent of class c1, any training document
which belongs to c1 also belongs to c0.
Mining the Web Chakrabarti & Ramakrishnan 61
Topic Hierarchies : Feature
selection
• Discriminating ability of a term sensitive to
the node (or class) in the hierarchy
• Measure of discrimination of a term
• Can be evaluated with respect to only internal
nodes of the hierarchy.
• `can' may be a noisy word at the root node of
Yahoo!
• Help classifying documents under the sub
tree of /Science/Environment/Recycling.
Mining the Web Chakrabarti & Ramakrishnan 62
Topic Hierarchies: Enhanced
parameter estimation
• Uniform priors not good
• Idea
• If a parameter estimate is shaky at a node
with few training documents, perhaps we can
impose a strong prior from a well-trained
parent to repair the estimates.
• Shrinkage
• Seeks to improve estimates of descendants
using data from ancestors,
Mining the Web Chakrabarti & Ramakrishnan 63
Shrinkage
• Assume multinomial model
• introducing a dummy class c0 as the parent of
the root c1, where all terms are equally likely.
• For a specific path c0,c1,…….cn,
• `shrunk' estimate is determined by a convex linear
interpolation of the MLE parameters at the ancestor
nodes up through c0
• Estimatation of mixing weights
• Simple form of EM algorithm
• Determined empirically, by iteratively maximizing the
probability of a held-out portion Hn of the training set
for node cn.
tcn ,
~

Mining the Web Chakrabarti & Ramakrishnan 64
Shrinkage: Observation
• Improves accuracy beyond hierarchical
naïve Bayes,
• Improvement is high when data is sparse
• Capable of utilizing many more features
than Naïve Bayes
Mining the Web Chakrabarti & Ramakrishnan 65
Topic search in Hierarchy
• By definition
• All documents are relevant to the root ‘topic’
• Pr(root|d) = 1.
• Given a test document d:
• Find one or more of the most likely leaf nodes
in the hierarchy.
• Document cannot belong to more than one
path,
• .
 
i
i dcdc )|Pr()|Pr( 0
Mining the Web Chakrabarti & Ramakrishnan 66
Topic search in Hierarchy: Greedy
Search strategy
• Search starts at the root
• Decisions are made greedily
• At each internal node pick the highest
probability class
• Continue
• Drawback
• Early errors cause compounding effect
Mining the Web Chakrabarti & Ramakrishnan 67
Topic search in Hierarchy: Best-first
search strategy
• For finding m most probable leaf classes
• Find the weighted shortest path from the
root to a leaf.
• Edge (c0,ci) is assigned a (non-negative)
edge weight of –Pr(ci|c0,d)
• .
• To make Best first search different from
greedy search
• Rescale/smoothen the probabilities
)),|Pr(log())|Pr(log()|Pr(log 00 dccdcdc ii 
Mining the Web Chakrabarti & Ramakrishnan 68
Using best-first search on a hierarchy can improve both accuracy and speed. Because
the hierarchy has four internal nodes, the second column shows the number of features
for each. These were tuned so that the total number of features for both at and best-first
are roughly the same (so that the model complexity is comparable). Because each
document belonged to exactly one leaf node, recall equals precision in this case and is
called `accuracy'.
Mining the Web Chakrabarti & Ramakrishnan 69
The semantics of hierarchical
classification
• Asymmetry
• training document can be associated with any
node,
• test document must be routed to a leaf,
• Routing test documents to internal nodes
• none of the children matches the document
• many children match the document
• the chances of making a mistake while
pushing down the test document one more
level may be too high.
• Research issue
Mining the Web Chakrabarti & Ramakrishnan 70
Maximum entropy learners:
Motivation
• Bayesian learner
• first model Pr(d|c) at training time
• Apply Bayes rule at test time
• Two problems with Bayesian learners
• d is represented in a high-dimensional term
space
• =>Pr(d|c) cannot be estimated accurately from a
training set of limited size.
• No systematic way of adding synthetic
features
• Such an addition may result in
• highly correlated features
Mining the Web Chakrabarti & Ramakrishnan 71
Maximum entropy learners
• Assume that each document has only one
class label
• Indicator functions fj(c,d)
• Flag ‘j’th condition relating class c to document
d
• Expectation of indicator fj is
• .
• Approximating Pr(d,c) and Pr(d) with their
empirical estimates
• .
  
cd d c
jjj cdfdcdcdfcdfE
,
),()|Pr()Pr(),(),Pr()(
  
i c
ijii
i
ii cdfdcdcd ),()|Pr()r(P
~
),r(P
~
Mining the Web Chakrabarti & Ramakrishnan 72
Principle of Maximum Entropy
• Constraints don’t determine Pr(c|d) uniquely
• Principle of Maximum Entropy:
• prefer the simplest model to explain observed data.
• Choose Pr(c|d) that maximizes the Entropy of Pr(c|d)
• In the event of empty training set we should consider
all classes to be equally likely,
• Constrained Optimization
• Maximize the entropy of the model distribution Pr(c|d)
 While obeying the constraints for all j
• Optimize by the method of Lagrange multipliers
Mining the Web Chakrabarti & Ramakrishnan 73
Maximum Entropy solution
• Fitting the distribution to the data
involves two steps:
1. Identify a set of indicator functions derived
from the data.
2. Iteratively arrive at values for the parameters
that satisfy the constraints while maximizing
the entropy of the distribution being
modeled.
• An equivalent optimization problemDd
d dcimise )|Pr(logmax
   
cd j i ci
ijiiijj cdfdccdfdcdcddcG
, ,
)),()|Pr(),(()|Pr(log)|Pr()Pr()),|(Pr( 
Mining the Web Chakrabarti & Ramakrishnan 74
Text Classification using Maximum
Entropy Model
• Example
• Pick an indicator for each (class, term) combination.
• For the binary document model,
• For the multinomial document model
• What we gain with Maximum Entropy over naïve
Bayes
• does not suffer from the independence assumptions
• E.g.:
• if the terms t1 = machine and t2 = learning are often found
together in class c,
• and would be suitably discounted.


 

otherwise0
dtandc’cif1
),(,' cdf tc




 


otherwise
),(
),(
c’cif0
),(,'

dn
tdn
cdf tc
1,tc 2,tc
Mining the Web Chakrabarti & Ramakrishnan 75
Performance of Maximum Entropy
Classifier
• Outperforms naive Bayes in accuracy, but
not consistently.
• Table of figures
Mining the Web Chakrabarti & Ramakrishnan 76
Discriminative classification
• Naïve Bayes and Maximum Entropy
Classifiers
• “induce” linear decision boundaries between
classes in the feature space.
• Discriminative classifiers
• Directly map the feature space to class labels
• Class labels are encoded as numbers
• e.g: +1 and –1 for two class problem
• Two examples
• Linear least-square regression
• Support Vector Machines
Mining the Web Chakrabarti & Ramakrishnan 77
Linear least-square regression
• No inherent reason for going through the modeling step as in
Bayesian or maximum entropy classifier to get a linear discriminant.
• Linear Regression Problem
• Look for some arbitrary such that directly predicts
the label ci of document di.
• Minimize the square error between the observed and predicted
class variable:
• Widrow-Hoff (WH) update rule.
• Scaling to norm 1
• Two equivalent interpretations
• Classifier is a hyperplane
• Documents are projected on to a direction
• Performance
• Comparable to Naïve Bayes and Max Ent
 bdi .
 
i
ii cbd 2
).(

Mining the Web Chakrabarti & Ramakrishnan 78
Support vector machines
• Assumption : training and test population are drawn from
the same distribution
• Hypothesis
• Hyperplane that is close to many training data points has a
greater chance of misclassifying test instances
• A hyperplane which passes through a “no-man's land”, has lower
chances of misclassifications
 Make a decision by thresholding
 Seek an which maximizes the distance of any
training point from the hyperplane
bdiSVM .
SVM
1,.....ni1b).d(csubject to
)||||
2
1
(.
2
1
Minimize
ii
2




Mining the Web Chakrabarti & Ramakrishnan 79
Support vector machines
• Optimal separator
• Orthogonal to the shortest line connecting the
convex hull of the two classes
• Intersects this shortest line halfway
• Margin:
• distance of any training point from the
optimized hyperplane
• It is at least

1
Mining the Web Chakrabarti & Ramakrishnan 80
Illustration of the SVM optimization problem.
Mining the Web Chakrabarti & Ramakrishnan 81
SVMs: non separable classes
• Classes in the training data not always
separable.
• Introduce fudge variables
• Equivalent dual
n1,........i0and
n.1,....,i-1b).d(csubject to
.
2
1
Minimize
i
iii


 



i
iC
n1,........iC1and
0csubject to
).(
2
1
Maximize
i
i
i
,i
i








i
ji
jijiji ddcc
Mining the Web Chakrabarti & Ramakrishnan 82
SVMs: Complexity
• Quadratic optimization problem.
• Working set: refine a few at a time holding
the others fixed.
• On-demand computation of inner-products
• n documents:
• Recent SVM packages
• Linear time by clever selection of working
sets.
2.1a1.7
ntimeTraining a


s'
Mining the Web Chakrabarti & Ramakrishnan 83
Performance
• Comparison with other classifiers
• Amongst most accurate classifier for text
• Better accuracy than naive Bayes and
decision tree classifier,
• interesting revelation
• Linear SVMs suffice
• standard text classification tasks have classes
almost separable using a hyperplane in
feature space
• Research issues
• Non-linear SVMs
Mining the Web Chakrabarti & Ramakrishnan 84
SVM training time variation as the training set size is increased, with and without
sufficient memory to hold the training set. In the latter case, the memory is set to about a
quarter of that needed by the training set.
Mining the Web Chakrabarti & Ramakrishnan 85
Comparison of LSVM with previous classifiers on the Reuters data set (data taken from
Dumais). (The naive Bayes classier used binary features, so its accuracy can be
improved)
Mining the Web Chakrabarti & Ramakrishnan 86
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear
SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory,
and University Web pages from WebKB.
Mining the Web Chakrabarti & Ramakrishnan 87
Comparison between several classifiers using the Reuters collection.
Mining the Web Chakrabarti & Ramakrishnan 88
Hypertext classification
• Techniques to address hypertextual
features.
• Document Object Model or DOM
• well-formed HTML document is a properly
nested hierarchy of regions in a tree-
structured
• DOM tree,
• internal nodes are elements
• some of the leaf nodes are segments of text.
• other nodes are hyperlinks to other Web
pages,
Mining the Web Chakrabarti & Ramakrishnan 89
Representing hypertext for
supervised learning
• Paying special attention to tags can help
with learning
• keyword-based search
• assign heuristic weights to terms that occur in
specific HTML tags
• Example…….. (next slide)
Mining the Web Chakrabarti & Ramakrishnan 90
Prefixing with tags
• Distinguishing between the two
occurrences of the word “surfing”,
• Prefixing each term by the sequence of tags
that we need to follow from the DOM root to
get to the term,
• A repeated term in different sections
should reinforce belief in a class label
• Using a maximum entropy classier
• Accumulate evidence from different features
• maintain both forms of a term:
• plain text and prefixed text (all path prefixes)
Mining the Web Chakrabarti & Ramakrishnan 91
Experiments
• 10705 patents from the US Patent Office,
• 70% error with plain text classier,
• 24% error with path-tagged terms
• 17%. Error with path prefixes
• 1700 resumes (with naive Bayes classifier)
• 53% error with flattened HTML
• 40% error with prefix-tagged terms
Mining the Web Chakrabarti & Ramakrishnan 92
Limitations
• Prefix representations
• ad-hoc
• inflexible.
• Generalisibility:
• How to incorporate additional features ?
• E.g.: adding features derived from hyperlinks.
• Relations
• uniform way to codify hypertextual features.
• Example:
Mining the Web Chakrabarti & Ramakrishnan 93
Rule Induction for relational
learning
• Inductive classifiers
• discover rules from a collection of relations.
• Example solution for above
• Goal : Discover a set of predicate rules
• Consider 2 class setting
• Positive examples D+ and negative examples
D-
• Test instance:
• True => positive instance. Else negative
instance.
Mining the Web Chakrabarti & Ramakrishnan 94
Rule induction with First Order
Inductive Logic (FOIL)
• Well-known rule learner
• Start with empty rule set
1. learn new (disjunctive) rule
2. add conjunctive literals to the new rule until
no negative example is covered by the new
rule.
3. pick a literal which increases the ratio of
surviving positive to negative bindings
rapidly.
4. Remove positive examples covered by any
rule generated thus far.
Mining the Web Chakrabarti & Ramakrishnan 95
Literals Explored
• where Q is a relation and Xi
are variables, at least one of which must
be already bound.
• not(L), where L is a literal of the above
forms.
constantaiscandvariablesareX,Xwhere;XX,XXc,X,XX jijijiiji 
),......XQ(X k1
Mining the Web Chakrabarti & Ramakrishnan 96
Analysis
• Can learn class labels for individual pages
• Can learn relationships between labels
• member(homePage, department)
• teaches(homePage, coursePage)
• advises(homePage, homePage)
• writes(homePage, paper)
• Hybrid approaches
• Statistical classifier
• more complex search for literals
• Inductive learning
• comparing the estimated probabilities of various classes.
• Recursively labeling relations
• E.g.: relating page label in terms of labels of neighboring pages
 classified(A, facultyPage) :-
 links-to(A, B), classified(B, studentPage),
 links-to(A, C), classified(C, coursePage),
 links-to(A, D), classified(D, publicationsPage).

More Related Content

What's hot

Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trendsKU Leuven
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Perficient
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval Tariq Hassan
 
Cec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasCec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasLourdes Araujo
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxfPhilippe Rocca-Serra
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationDmitry Grapov
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisDmitry Grapov
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...Nandana Mihindukulasooriya
 

What's hot (14)

Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsTowards Automatic Analysis of Online Discussions among Hong Kong Students
Towards Automatic Analysis of Online Discussions among Hong Kong Students
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
Cec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasCec2010 araujo pereziglesias
Cec2010 araujo pereziglesias
 
Dats nih-dccpc-kc7-april2018-prs-uoxf
Dats  nih-dccpc-kc7-april2018-prs-uoxfDats  nih-dccpc-kc7-april2018-prs-uoxf
Dats nih-dccpc-kc7-april2018-prs-uoxf
 
IR
IRIR
IR
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 
Automation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report GenerationAutomation of (Biological) Data Analysis and Report Generation
Automation of (Biological) Data Analysis and Report Generation
 
Strategies for Metabolomics Data Analysis
Strategies for Metabolomics Data AnalysisStrategies for Metabolomics Data Analysis
Strategies for Metabolomics Data Analysis
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
 

Similar to Unit iii

search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
MIS 542 Syllabus 08.doc
MIS 542 Syllabus 08.docMIS 542 Syllabus 08.doc
MIS 542 Syllabus 08.docbutest
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...Artificial Intelligence Institute at UofSC
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsQuantUniversity
 
Coursemodule dbms
Coursemodule dbmsCoursemodule dbms
Coursemodule dbmsrupalidhir
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
 

Similar to Unit iii (20)

search.ppt
search.pptsearch.ppt
search.ppt
 
search engine
search enginesearch engine
search engine
 
MIS 542 Syllabus 08.doc
MIS 542 Syllabus 08.docMIS 542 Syllabus 08.doc
MIS 542 Syllabus 08.doc
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Personalized classifiers
Personalized classifiersPersonalized classifiers
Personalized classifiers
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
Coursemodule dbms
Coursemodule dbmsCoursemodule dbms
Coursemodule dbms
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 

Recently uploaded

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Recently uploaded (20)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 

Unit iii

  • 2. Mining the Web Chakrabarti & Ramakrishnan 2 Organizing knowledge  Systematic knowledge structures  Ontologies • Dewey decimal system, the Library of Congress catalog, the AMS Mathematics Subject • Classification, and the US Patent subject classification  Web catalogs • Yahoo & Dmoz  Problem: Manual maintenance
  • 3. Mining the Web Chakrabarti & Ramakrishnan 3 Topic Tagging  Finding similar documents  Guiding queries  Naïve Approach: • Syntactic similarity between documents  Better approach • Topic tagging
  • 4. Mining the Web Chakrabarti & Ramakrishnan 4 Topic Tagging  Advantages • Increase vocabulary of classes • Hierarchical visualization and browsing aids  Applications • Email/Bookmark organization • News Tracking • Tracking authors of anonymous texts  E.g.: The Flesch-Kincaid index • classify the purpose of hyperlinks.
  • 5. Mining the Web Chakrabarti & Ramakrishnan 5 Supervised learning  Learning to assign objects to classes given examples  Learner (classifier) A typical supervised text learning scenario.
  • 6. Mining the Web Chakrabarti & Ramakrishnan 6 Difference with texts  M.L classification techniques used for structured data  Text: lots of features and lot of noise  No fixed number of columns  No categorical attribute values  Data scarcity  Larger number of class label  Hierarchical relationships between classes less systematic unlike structured data
  • 7. Mining the Web Chakrabarti & Ramakrishnan 7 Techniques  Nearest Neighbor Classifier • Lazy learner: remember all training instances • Decision on test document: distribution of labels on the training documents most similar to it • Assigns large weights to rare terms  Feature selection • removes terms in the training documents which are statistically uncorrelated with the class labels,  Bayesian classifier • Fit a generative term distribution Pr(d|c) to each class c of documents {d}. • Testing: The distribution most likely to have generated a test document is used to label it.
  • 8. Mining the Web Chakrabarti & Ramakrishnan 8 Other Classifiers  Maximum entropy classifier: • Estimate a direct distribution Pr(cjd) from term space to the probability of various classes.  Support vector machines: • Represent classes by numbers • Construct a direct function from term space to the class variable.  Rule induction: • Induce rules for classification over diverse features • E.g.: information from ordinary terms, the structure of the HTML tag tree in which terms are embedded, link neighbors, citations
  • 9. Mining the Web Chakrabarti & Ramakrishnan 9 Other Issues  Tokenization • E.g.: replacing monetary amounts by a special token  Evaluating text classifier • Accuracy • Training speed and scalability • Simplicity, speed, and scalability for document modifications • Ease of diagnosis, interpretation of results, and adding human judgment and feedback subjective
  • 10. Mining the Web Chakrabarti & Ramakrishnan 10 Benchmarks for accuracy  Reuters • 10700 labeled documents • 10% documents with multiple class labels  OHSUMED • 348566 abstracts from medical journals  20NG • 18800 labeled USENET postings • 20 leaf classes, 5 root level classes  WebKB • 8300 documents in 7 academic categories.  Industry • 10000 home pages of companies from 105 industry sectors • Shallow hierarchies of sector names
  • 11. Mining the Web Chakrabarti & Ramakrishnan 11 Measures of accuracy  Assumptions • Each document is associated with exactly one class. OR • Each document is associated with a subset of classes.  Confusion matrix (M) • For more than 2 classes • M[i; j] : number of test documents belonging to class i which were assigned to class j • Perfect classifier: diagonal elements M[i; i] would be nonzero.
  • 12. Mining the Web Chakrabarti & Ramakrishnan 12 Evaluating classifier accuracy  Two-way ensemble • To avoid searching over the power-set of class labels in the subset scenario • Create positive and negative classes for each document d (E.g.: “Sports” and “Not sports” (all remaining documents)  Recall and precision • contingency matrix per (d,c) pair 22 ]coutputnotdoesclassierandCc[[1,1]M ]coutputsclassierandCc[[1,0]M ]coutputnotdoesclassierandCc[[0,1]M ]coutputsclassierandCc[[0,0]M dcd, dcd, dcd, dcd,     )( dC )( dC
  • 13. Mining the Web Chakrabarti & Ramakrishnan 13 Evaluating classifier accuracy (contd.) • micro averaged contingency matrix • micro averaged contingency matrix • micro averaged precision and recall  Equal importance for each document • Macro averaged precision and recall  Equal importance for each class  cd cdMM , , ]0,1[]0,0[ ]0,0[ )(    MM M precisionM   ]1,0[]0,0[ ]0,0[ )(    MM M recallM    c d dcc M C M , || 1 ]0,1[]0,0[ ]0,0[ )( cc c c MM M precisionM   ]1,0[]0,0[ ]0,0[ )( cc c c MM M recallM  
  • 14. Mining the Web Chakrabarti & Ramakrishnan 14 Evaluating classifier accuracy (contd.) • Precision – Recall tradeoff  Plot of precision vs. recall: Better classifier has higher curvature  Harmonic mean : Discard classifiers that sacrifice one for the other precisionrecall precisionrecall2 F1   
  • 15. Mining the Web Chakrabarti & Ramakrishnan 15 Nearest Neighbor classifiers  Intuition • similar documents are expected to be assigned the same class label. • Vector space model + cosine similarity • Training:  Index each document and remember class label • Testing:  Fetch “k” most similar document to given document – Majority class wins – Alternative: Weighted counts – counts of classes weighted by the corresponding similarity measure – Alternative: per-class offset bc which is tuned by testing the classier on a portion of training data held out for this purpose.
  • 16. Mining the Web Chakrabarti & Ramakrishnan 16 Nearest neighbor classification
  • 17. Mining the Web Chakrabarti & Ramakrishnan 17 Pros  Easy availability and reuse of of inverted index  Collection updates trivial  Accuracy comparable to best known classifiers
  • 18. Mining the Web Chakrabarti & Ramakrishnan 18 Cons  Iceberg category questions • involves as many inverted index lookups as there are distinct terms in dq, • scoring the (possibly large number of) candidate documents which overlap with dq in at least one word, • sorting by overall similarity, • picking the best k documents,  Space overhead and redundancy • Data stored at level of individual documents • No distillation
  • 19. Mining the Web Chakrabarti & Ramakrishnan 19 Workarounds  To reducing space requirements and speed up classification • Find clusters in the data • Store only a few statistical parameters per cluster. • Compare with documents in only the most promising clusters.  Again…. • Ad-hoc choices for number and size of clusters and parameters. • k is corpus sensitive
  • 20. Mining the Web Chakrabarti & Ramakrishnan 20 TF-IDF  TF-IDF done for whole corpus  Interclass correlations and term frequencies unaccounted for  Terms which occur relatively frequently in some classes compared to others should have higher importance  Overall rarity in the corpus is not as important.
  • 21. Mining the Web Chakrabarti & Ramakrishnan 21 Feature selection  Data sparsity: • Term distribution could be estimated if training set larger than test • Not the case however……. • Vocabulary documents • For Reuters, only about 10300 documents available.  Over-fitting problem • Joint distribution may fit training instances….. • But may not fit unforeseen test data that well || 2W W 
  • 22. Mining the Web Chakrabarti & Ramakrishnan 22 Marginals rather than joint  Marginal distribution of each term in each class  Empirical distributions may not still reflect actual distributions if data is sparse  Therefore feature selection • Purposes:  Improve accuracy by avoiding over fitting  maintain accuracy while discarding as many features as possible to save a great deal of space for storing statistics • Heuristic, guided by linguistic and domain knowledge, or statistical.
  • 23. Mining the Web Chakrabarti & Ramakrishnan 23 Feature selection  Perfect feature selection • goal-directed • pick all possible subsets of features • for each subset train and test a classier • retain that subset which resulted in the highest accuracy. • COMPUTATIONALLY INFEASIBLE  Simple heuristics • Stop words like “a”, “an”, “the” etc. • Empirically chosen thresholds (task and corpus sensitive) for ignoring “too frequent” or “too rare” terms • Discard “too frequent” and “too rare terms”  Larger and complex data sets • Confusion with stop words • Especially for topic hierarchies  Greedy inclusion (bottom up) vs. top-down
  • 24. Mining the Web Chakrabarti & Ramakrishnan 24 Greedy inclusion algorithm  Most commonly used in text  Algorithm: 1. Compute, for each term, a measure of discrimination amongst classes. 2. Arrange the terms in decreasing order of this measure. 3. Retain a number of the best terms or features for use by the classier. • Greedy because • measure of discrimination of a term is computed independently of other terms • Over-inclusion: mild effects on accuracy
  • 25. Mining the Web Chakrabarti & Ramakrishnan 25 Measure of discrimination • Dependent on • model of documents • desired speed of training • ease of updates to documents and class assignments. • Observations • sets included for acceptable accuracy tend to have large overlap.
  • 26. Mining the Web Chakrabarti & Ramakrishnan 26 The test • Similar to the likelihood ratio test • Build a 2 x 2 contingency matrix per class- term pair  Under the independence hypothesis • Aggregates the deviations of observed values from expected values • Larger the value of , the lower is our belief that the independence assumption is upheld by the observed data. 2  ttermcontainingiclassindocumentsofnumberk ttermcontainingnoticlassindocumentsofnumberk i,1 i,0   2 
  • 27. Mining the Web Chakrabarti & Ramakrishnan 27 The test • Feature selection process • Sort terms in decreasing order of their values, • Train several classifier with varying number of features • Stopping at the point of maximum accuracy. 2  2  ))()()(( )( )Pr()Pr( )Pr()Pr( 0010011100011011 2 01100011 , ,2 kkkkkkkk kkkkn mIlCn mIlCnk ml t tml       
  • 28. Mining the Web Chakrabarti & Ramakrishnan 28 Mutual information • Useful when the multinomial document model is used • X and Y are discrete random variables taking values x,y • Mutual information (MI) between them is defined as • Measure of extent of dependence between random variables, • Extent to which the joint deviates from the product of the marginals • Weighted with the distribution mass at (x; y)  x y yx yx yxYXMI )Pr()Pr( ),Pr( log),Pr(),(
  • 29. Mining the Web Chakrabarti & Ramakrishnan 29 Mutual Information  Advantages • To the extent MI(X,Y) is large, X and Y are dependent. • Deviations from independence at rare values of (x,y) are played down • Interpretations • Reduction in the entropy of Y given X. • MI(X; Y ) = H(X) – H(X|Y) = H(Y) – H(Y|X) • KL distance between no-independence hypothesis and independence hypothesis • KL distance gives the average number of bits wasted by encoding events from the `correct‘ distribution using a code based on a not-quite-right distribution
  • 30. Mining the Web Chakrabarti & Ramakrishnan 30 Feature selection with MI • Fix a term t and let be an event associated with that term. • E.g.: For the binary model, = 0/1, • Pr( ) = the empirical fraction of documents in the training set in which event it occurred. • Pr( ,c) = the empirical fraction of training documents which are in class c • Pr(c) = fraction of training documents belonging to class c. • Formula: • Problem : document lengths are not normalized. tI tI tI tI 2 ,1,01,0, , , , /))(( / log),( nkkkk nk n k CIMI mmll ml ml ml t   
  • 31. Mining the Web Chakrabarti & Ramakrishnan 31 Fisher's discrimination index • Useful when documents are scaled to constant length • Term occurrences are regarded as fractional real numbers. • E.g.: Two class case • Let X and Y be the sets of length normalized document vectors corresponding to the two classes. • Let and be centroids for each class. • Covariance matrices be || )( X x X X   || )( Y y Y Y     X T XXX xxXS ))((|)|/1(    Y T YYY yyYS ))((|)|/1( 
  • 32. Mining the Web Chakrabarti & Ramakrishnan 32 Fisher's discrimination index (contd.) • Goal : find a projection of the data sets X and Y on to a line such that • the two projected centroids are far apart compared to the spread of the point sets projected on to the same line. • Find a column vector such that  the ratio of – the square of the difference in mean vectors projected onto it – & average projected variance  is maximized. • This gives m *  2 ))(( YX T    )( 2 1 YX T SS      )( ))(( maxarg 2 * YX T YX T SS   
  • 33. Mining the Web Chakrabarti & Ramakrishnan 33 Fisher's discrimination index • Formula • Let X and Y for both the training and test data are generated from multivariate Gaussian distributions • Let • Then this value of induces the optimal (minimum error) classier by suitable thresholding on for a test point q. • Problems • Inverting S would be unacceptably slow for tens of thousands of dimensions. • Llinear transformations would destroy already existing sparsity. YX SS  qT 
  • 34. Mining the Web Chakrabarti & Ramakrishnan 34 Solution • Recall: • Goal was to eliminate terms from consideration. • Not to arrive at linear projections involving multiple terms • Regard each term t as providing a candidate direction t which is parallel to the corresponding axis in the vector space model. • Compute the Fisher's index of t
  • 35. Mining the Web Chakrabarti & Ramakrishnan 35 FI : Solution (contd.) • Formula • For two class case • Can be generalized to a set {c} of more than two classes • Feature selection • Terms are sorted in decreasing order of FI(t) • Best ones chosen as features.        X Y tYttXt tYtX t T t YX T yYxXS tFI t 2 , 2 , 2 ,, 2 )(|)|/1()(|)|/1( )())(( )(            c Dd tctd c cc tctc c x D tFI 2 ,, 2 , ,, )( || 1 )( )( 21 21  
  • 36. Mining the Web Chakrabarti & Ramakrishnan 36 Validation • How to decide a cut-off rank ? • Validation approach • A portion of the training documents are held out • The rest is used to do term ranking • The held-out set used as a test set. • Various cut-off ranks can be tested using the same held-out set. • Leave-one-out cross-validation/partitioning data into two • An aggregate accuracy is computed over all trials. • Wrapper to search for the number of features • In decreasing order of discriminative power
  • 37. Mining the Web Chakrabarti & Ramakrishnan 37 Validation (contd.) • Simple search heuristic • Keep adding one feature at every step until the classifier's accuracy ceases to improve. A general illustration of wrapping for feature selection.
  • 38. Mining the Web Chakrabarti & Ramakrishnan 38 Validation (contd.) • For naive Bayes-like classier • Evaluation on many choices of feature sets can be done at once. • For Maximum Entropy/Support vector machines • Essentially involves training a classier from scratch for each choice of the cut-off rank. • Therefore inefficient
  • 39. Mining the Web Chakrabarti & Ramakrishnan 39 Validation : observations • Bayesian classifier cannot over fit much Effect of feature selection on Bayesian classifiers
  • 40. Mining the Web Chakrabarti & Ramakrishnan 40 Truncation algorithms • Start from the complete set of terms T 1. Keep selecting terms to drop 2. Till you end up with a feature subset 3. Question: When should you stop truncation ? • Two objectives • minimize the size of selected feature set F. • Keep the distorted distribution Pr(C|F) as similar as possible to the original Pr(CjT)
  • 41. Mining the Web Chakrabarti & Ramakrishnan 41 Truncation Algorithms: Example • Kullback-Leibler (KL) • Measures similarity or distance between two distributions • Markov Blanket • Let X be a feature in T. Let • The presence of M renders the presence of X unnecessary as a feature => M is a Markov blanket for X • Technically • M is called a Markov blanket for if X is conditionally independent of given M • eliminating a variable because it has a Markov blanket contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F). }{ XTM  TX  )}{()( XMCT 
  • 42. Mining the Web Chakrabarti & Ramakrishnan 42 Finding Markov Blankets • Absence of Markov Blanket in practice • Finding approximate Markov blankets • Purpose: To cut down computational complexity • search for Markov blankets M to those with at most k features. • given feature X, search for the members of M to those features which are most strongly correlated (using tests similar to the 2 or MI tests) with X. • Example : For Reuters dataset, over two- thirds of T could be discarded while increasing classification accuracy
  • 43. Mining the Web Chakrabarti & Ramakrishnan 43 Feature Truncation algorithm 1. while truncated Pr(C|F) is reasonably close to original Pr(C|T) do 2. for each remaining feature X do 3. Identify a candidate Markov blanket M: 4. For some tuned constant k, find the set M of k variables in F X that are most strongly correlated with X 5. Estimate how good a blanket M is 6. Estimate 7. end for 8. Eliminate the feature having the best surviving Markov blanket 9. end while   xx MM M KLxXxX , MMMM ))xX|Pr(Cx),X,xX|(C(Pr),Pr(
  • 44. Mining the Web Chakrabarti & Ramakrishnan 44 General observations on feature selection• The issue of document length should be addressed properly. • Choice of association measures does not make a dramatic difference • Greedy inclusion algorithms scale nearly linearly with the number of features • Markov blanket technique takes time proportional to at least . • Advantage of Markov blankets algo over greedy inclusion • Greedy algo may include features with high individual correlations even though one subsumes the other • Features individually uncorrelated could be jointly more correlated with the class • This rarely happens • Binary feature selection view may not be only view to subscribe to k T ||
  • 45. Mining the Web Chakrabarti & Ramakrishnan 45 Bayesian Learner • Very practical text classifier • Assumption 1. A document can belong to exactly one of a set of classes or topics. 2. Each class c has an associated prior probability Pr(c), 3. There is a class-conditional document distribution Pr(djc) for each class. • Posterior probability • Obtained using Bayes Rule • Parameter set consists of all P(d|c)     )|Pr()Pr( )|Pr()Pr( )|Pr( d cdc dc 
  • 46. Mining the Web Chakrabarti & Ramakrishnan 46 Parameter Estimation for Bayesian Learner • Estimate of is based on two sources of information: 1. Prior knowledge on the parameter set before seeing any training documents 2. Terms in the training documents D. • Bayes Optimal Classifier • Taking the expectation of each parameter over Pr( |D) • Computationally infeasible • Maximum likelihood estimate • Replace the sum above with the value of the summand (Pr(c|d, )) for arg max Pr(D| ), • Works poorly           )|Pr( ),|Pr()|Pr( ),|Pr()|Pr( )|Pr( D d cdc dc  
  • 47. Mining the Web Chakrabarti & Ramakrishnan 47 Naïve Bayes Classifier • Naïve • assumption of independence between terms, • joint term distribution is the product of the marginals. • Widely used owing to • simplicity and speed of training, applying, and updating • Two kinds of widely used marginals for text • Binary model • Multinomial model
  • 48. Mining the Web Chakrabarti & Ramakrishnan 48 Naïve Bayes Models • Binary Model • Each parameter indicates the probability that a document in class c will mention term t at least once. • Multinomial model • each class has an associated die with |W| faces. • each parameter denotes probability of the face turning up on tossing the die. • term t occurs n(d; t) times in document d, • document length is a random variable denoted L,  . • .  Ddforaccountto , , , , ,, )1( 1 )1()|Pr(       Wt tc dt tc tc dtWt tc dt tccd             dt tdn t d ddd tdn l clLcldclLcd ),( )},({ )|Pr(),|Pr()|Pr()|Pr(  Wt 
  • 49. Mining the Web Chakrabarti & Ramakrishnan 49 Analysis of Naïve Bayes Models 1. Multiply together a large number of small probabilities, • Result: extremely tiny probabilities as answers. • Solution : store all numbers as logarithms 2. Class which comes out at the top wins by a huge margin • Sanitizing scores using likelihood ration • Also called the logit function • . )|1Pr( )|1Pr( )(, 1 1 )(log )( dC dC dLR e dit dLR      
  • 50. Mining the Web Chakrabarti & Ramakrishnan 50 Parameter smoothing • What if a test document contains a term t that never occurred in any training document in class c ? • Ans : will be zero • Even if many other terms clearly hint at a high likelihood of class c generating the document. • Bayesian Estimation • Estimating probability from insufficient data.  If you toss a coin n times and it always comes up heads, what is the probability that the (n + 1)th toss will also come up heads? • posit a prior distribution on , called  E.g.: The uniform distribution • Resultant posterior distribution: qd )|Pr( qdc  )(     1 0 )|,Pr()( )|,Pr()( ),|( pnkpdp nk nk   
  • 51. Mining the Web Chakrabarti & Ramakrishnan 51 Laplace Smoothing • Based on Bayesian Estimation • Laplace's law of succession • loss function (penalty) for picking a smoothed value as against the `true' value. • E.g.: Loss function as the square error • For this choice of loss,the best choice of the smoothed parameter is simply the expectation of the posterior distribution on having observed the data: • . ) ~ ,( L  ~ 2 1 )),|(( ~    n k nkE 
  • 52. Mining the Web Chakrabarti & Ramakrishnan 52 Laplace Smoothing (contd.) • Heuristic alternatives • Lidstone's law of succession • . • derivation for the multinomial model • there are |W| possible events where W is the vocabulary. • .    2 ~    n k        dDd Dd tc c c dnW tdn    , , ),(|| ),(1
  • 53. Mining the Web Chakrabarti & Ramakrishnan 53 Performance analysis • Multinomial naive Bayes classifier generally outperforms the binary variant • K-NN may outperform naïve Bayes • Naïve Bayes is faster and more compact • decision boundaries: • regions of potential confusion
  • 54. Mining the Web Chakrabarti & Ramakrishnan 54 NB: Decision boundaries • Bayesian classier partitions the multidimensional term space into regions • Within each region, the probability of one class is higher than others • On the boundaries, the probability of two or more classes are exactly equal • NB is a linear classier • it makes a decision between c = 1 and c = -1 • by thresholding the value of (b=prior) for a suitable vector bdNB . NB
  • 55. Mining the Web Chakrabarti & Ramakrishnan 55 Pitfalls • Strong bias • fixes the policy that (tth component of the linear discriminant) depends only on the statistics of term t in the corpus. • Therefore it cannot pick from the entire set of possible linear discriminants, )(tNB
  • 56. Mining the Web Chakrabarti & Ramakrishnan 56 Bayesian Networks • Attempt to capture statistical dependencies between terms themselves • Approximations to the joint distribution over terms • Probability of a term occurring depends on observation about other terms as well as the class variable. • A directed acyclic graph • All random variables (classes and terms) are nodes • Dependency edges are drawn from c to t for each t.(parent-child edges) • To represent additional dependencies between terms dependency edges (parent child) are drawn
  • 57. Mining the Web Chakrabarti & Ramakrishnan 57 Bayesian networks. For the naive Bayes assumption, the only edges are from the class variable to individual terms. Towards better approximations to the joint distribution over terms: the probability of a term occurring may now depend on observation about other terms as well as class variable.
  • 58. Mining the Web Chakrabarti & Ramakrishnan 58 Bayesian Belief Network (BBN) • DAG • Parents Pa(X) • nodes that are connected by directed edges to a node X • Fixing the values of the parent variables completely determines the conditional distribution of X • Conditional Probability tables • For discrete variables, the distribution data for X can be stored in the obvious way as a table with each row showing a set of values of the parents, the value of X, and a conditional probability. • Unlike Naïve Bayes • P(d|c) is not a simple product over all terms.  x Xpaxx ))(|Pr()Pr(
  • 59. Mining the Web Chakrabarti & Ramakrishnan 59 BBN: difficulty • Getting a good network structure. • At least quadratic time • Enumeration of all pairs of features • Exploited only for binary model • Multinomial model • Prohibitive CPT sizes
  • 60. Mining the Web Chakrabarti & Ramakrishnan 60 Exploiting hierarchy among topics • Ordering between the class labels • For Data warehousing • E.g. : high, medium, or low cancer risk patients. • Text Class labels: • Taxonomy: • large and complex class hierarchy that relates the class labels • Tree structure • Simplest form of taxonomy • widely used in directory browsing, • often the output of clustering algorithms. • inheritance: • If class c0 is the parent of class c1, any training document which belongs to c1 also belongs to c0.
  • 61. Mining the Web Chakrabarti & Ramakrishnan 61 Topic Hierarchies : Feature selection • Discriminating ability of a term sensitive to the node (or class) in the hierarchy • Measure of discrimination of a term • Can be evaluated with respect to only internal nodes of the hierarchy. • `can' may be a noisy word at the root node of Yahoo! • Help classifying documents under the sub tree of /Science/Environment/Recycling.
  • 62. Mining the Web Chakrabarti & Ramakrishnan 62 Topic Hierarchies: Enhanced parameter estimation • Uniform priors not good • Idea • If a parameter estimate is shaky at a node with few training documents, perhaps we can impose a strong prior from a well-trained parent to repair the estimates. • Shrinkage • Seeks to improve estimates of descendants using data from ancestors,
  • 63. Mining the Web Chakrabarti & Ramakrishnan 63 Shrinkage • Assume multinomial model • introducing a dummy class c0 as the parent of the root c1, where all terms are equally likely. • For a specific path c0,c1,…….cn, • `shrunk' estimate is determined by a convex linear interpolation of the MLE parameters at the ancestor nodes up through c0 • Estimatation of mixing weights • Simple form of EM algorithm • Determined empirically, by iteratively maximizing the probability of a held-out portion Hn of the training set for node cn. tcn , ~ 
  • 64. Mining the Web Chakrabarti & Ramakrishnan 64 Shrinkage: Observation • Improves accuracy beyond hierarchical naïve Bayes, • Improvement is high when data is sparse • Capable of utilizing many more features than Naïve Bayes
  • 65. Mining the Web Chakrabarti & Ramakrishnan 65 Topic search in Hierarchy • By definition • All documents are relevant to the root ‘topic’ • Pr(root|d) = 1. • Given a test document d: • Find one or more of the most likely leaf nodes in the hierarchy. • Document cannot belong to more than one path, • .   i i dcdc )|Pr()|Pr( 0
  • 66. Mining the Web Chakrabarti & Ramakrishnan 66 Topic search in Hierarchy: Greedy Search strategy • Search starts at the root • Decisions are made greedily • At each internal node pick the highest probability class • Continue • Drawback • Early errors cause compounding effect
  • 67. Mining the Web Chakrabarti & Ramakrishnan 67 Topic search in Hierarchy: Best-first search strategy • For finding m most probable leaf classes • Find the weighted shortest path from the root to a leaf. • Edge (c0,ci) is assigned a (non-negative) edge weight of –Pr(ci|c0,d) • . • To make Best first search different from greedy search • Rescale/smoothen the probabilities )),|Pr(log())|Pr(log()|Pr(log 00 dccdcdc ii 
  • 68. Mining the Web Chakrabarti & Ramakrishnan 68 Using best-first search on a hierarchy can improve both accuracy and speed. Because the hierarchy has four internal nodes, the second column shows the number of features for each. These were tuned so that the total number of features for both at and best-first are roughly the same (so that the model complexity is comparable). Because each document belonged to exactly one leaf node, recall equals precision in this case and is called `accuracy'.
  • 69. Mining the Web Chakrabarti & Ramakrishnan 69 The semantics of hierarchical classification • Asymmetry • training document can be associated with any node, • test document must be routed to a leaf, • Routing test documents to internal nodes • none of the children matches the document • many children match the document • the chances of making a mistake while pushing down the test document one more level may be too high. • Research issue
  • 70. Mining the Web Chakrabarti & Ramakrishnan 70 Maximum entropy learners: Motivation • Bayesian learner • first model Pr(d|c) at training time • Apply Bayes rule at test time • Two problems with Bayesian learners • d is represented in a high-dimensional term space • =>Pr(d|c) cannot be estimated accurately from a training set of limited size. • No systematic way of adding synthetic features • Such an addition may result in • highly correlated features
  • 71. Mining the Web Chakrabarti & Ramakrishnan 71 Maximum entropy learners • Assume that each document has only one class label • Indicator functions fj(c,d) • Flag ‘j’th condition relating class c to document d • Expectation of indicator fj is • . • Approximating Pr(d,c) and Pr(d) with their empirical estimates • .    cd d c jjj cdfdcdcdfcdfE , ),()|Pr()Pr(),(),Pr()(    i c ijii i ii cdfdcdcd ),()|Pr()r(P ~ ),r(P ~
  • 72. Mining the Web Chakrabarti & Ramakrishnan 72 Principle of Maximum Entropy • Constraints don’t determine Pr(c|d) uniquely • Principle of Maximum Entropy: • prefer the simplest model to explain observed data. • Choose Pr(c|d) that maximizes the Entropy of Pr(c|d) • In the event of empty training set we should consider all classes to be equally likely, • Constrained Optimization • Maximize the entropy of the model distribution Pr(c|d)  While obeying the constraints for all j • Optimize by the method of Lagrange multipliers
  • 73. Mining the Web Chakrabarti & Ramakrishnan 73 Maximum Entropy solution • Fitting the distribution to the data involves two steps: 1. Identify a set of indicator functions derived from the data. 2. Iteratively arrive at values for the parameters that satisfy the constraints while maximizing the entropy of the distribution being modeled. • An equivalent optimization problemDd d dcimise )|Pr(logmax     cd j i ci ijiiijj cdfdccdfdcdcddcG , , )),()|Pr(),(()|Pr(log)|Pr()Pr()),|(Pr( 
  • 74. Mining the Web Chakrabarti & Ramakrishnan 74 Text Classification using Maximum Entropy Model • Example • Pick an indicator for each (class, term) combination. • For the binary document model, • For the multinomial document model • What we gain with Maximum Entropy over naïve Bayes • does not suffer from the independence assumptions • E.g.: • if the terms t1 = machine and t2 = learning are often found together in class c, • and would be suitably discounted.      otherwise0 dtandc’cif1 ),(,' cdf tc         otherwise ),( ),( c’cif0 ),(,'  dn tdn cdf tc 1,tc 2,tc
  • 75. Mining the Web Chakrabarti & Ramakrishnan 75 Performance of Maximum Entropy Classifier • Outperforms naive Bayes in accuracy, but not consistently. • Table of figures
  • 76. Mining the Web Chakrabarti & Ramakrishnan 76 Discriminative classification • Naïve Bayes and Maximum Entropy Classifiers • “induce” linear decision boundaries between classes in the feature space. • Discriminative classifiers • Directly map the feature space to class labels • Class labels are encoded as numbers • e.g: +1 and –1 for two class problem • Two examples • Linear least-square regression • Support Vector Machines
  • 77. Mining the Web Chakrabarti & Ramakrishnan 77 Linear least-square regression • No inherent reason for going through the modeling step as in Bayesian or maximum entropy classifier to get a linear discriminant. • Linear Regression Problem • Look for some arbitrary such that directly predicts the label ci of document di. • Minimize the square error between the observed and predicted class variable: • Widrow-Hoff (WH) update rule. • Scaling to norm 1 • Two equivalent interpretations • Classifier is a hyperplane • Documents are projected on to a direction • Performance • Comparable to Naïve Bayes and Max Ent  bdi .   i ii cbd 2 ).( 
  • 78. Mining the Web Chakrabarti & Ramakrishnan 78 Support vector machines • Assumption : training and test population are drawn from the same distribution • Hypothesis • Hyperplane that is close to many training data points has a greater chance of misclassifying test instances • A hyperplane which passes through a “no-man's land”, has lower chances of misclassifications  Make a decision by thresholding  Seek an which maximizes the distance of any training point from the hyperplane bdiSVM . SVM 1,.....ni1b).d(csubject to )|||| 2 1 (. 2 1 Minimize ii 2    
  • 79. Mining the Web Chakrabarti & Ramakrishnan 79 Support vector machines • Optimal separator • Orthogonal to the shortest line connecting the convex hull of the two classes • Intersects this shortest line halfway • Margin: • distance of any training point from the optimized hyperplane • It is at least  1
  • 80. Mining the Web Chakrabarti & Ramakrishnan 80 Illustration of the SVM optimization problem.
  • 81. Mining the Web Chakrabarti & Ramakrishnan 81 SVMs: non separable classes • Classes in the training data not always separable. • Introduce fudge variables • Equivalent dual n1,........i0and n.1,....,i-1b).d(csubject to . 2 1 Minimize i iii        i iC n1,........iC1and 0csubject to ).( 2 1 Maximize i i i ,i i         i ji jijiji ddcc
  • 82. Mining the Web Chakrabarti & Ramakrishnan 82 SVMs: Complexity • Quadratic optimization problem. • Working set: refine a few at a time holding the others fixed. • On-demand computation of inner-products • n documents: • Recent SVM packages • Linear time by clever selection of working sets. 2.1a1.7 ntimeTraining a   s'
  • 83. Mining the Web Chakrabarti & Ramakrishnan 83 Performance • Comparison with other classifiers • Amongst most accurate classifier for text • Better accuracy than naive Bayes and decision tree classifier, • interesting revelation • Linear SVMs suffice • standard text classification tasks have classes almost separable using a hyperplane in feature space • Research issues • Non-linear SVMs
  • 84. Mining the Web Chakrabarti & Ramakrishnan 84 SVM training time variation as the training set size is increased, with and without sufficient memory to hold the training set. In the latter case, the memory is set to about a quarter of that needed by the training set.
  • 85. Mining the Web Chakrabarti & Ramakrishnan 85 Comparison of LSVM with previous classifiers on the Reuters data set (data taken from Dumais). (The naive Bayes classier used binary features, so its accuracy can be improved)
  • 86. Mining the Web Chakrabarti & Ramakrishnan 86 Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory, and University Web pages from WebKB.
  • 87. Mining the Web Chakrabarti & Ramakrishnan 87 Comparison between several classifiers using the Reuters collection.
  • 88. Mining the Web Chakrabarti & Ramakrishnan 88 Hypertext classification • Techniques to address hypertextual features. • Document Object Model or DOM • well-formed HTML document is a properly nested hierarchy of regions in a tree- structured • DOM tree, • internal nodes are elements • some of the leaf nodes are segments of text. • other nodes are hyperlinks to other Web pages,
  • 89. Mining the Web Chakrabarti & Ramakrishnan 89 Representing hypertext for supervised learning • Paying special attention to tags can help with learning • keyword-based search • assign heuristic weights to terms that occur in specific HTML tags • Example…….. (next slide)
  • 90. Mining the Web Chakrabarti & Ramakrishnan 90 Prefixing with tags • Distinguishing between the two occurrences of the word “surfing”, • Prefixing each term by the sequence of tags that we need to follow from the DOM root to get to the term, • A repeated term in different sections should reinforce belief in a class label • Using a maximum entropy classier • Accumulate evidence from different features • maintain both forms of a term: • plain text and prefixed text (all path prefixes)
  • 91. Mining the Web Chakrabarti & Ramakrishnan 91 Experiments • 10705 patents from the US Patent Office, • 70% error with plain text classier, • 24% error with path-tagged terms • 17%. Error with path prefixes • 1700 resumes (with naive Bayes classifier) • 53% error with flattened HTML • 40% error with prefix-tagged terms
  • 92. Mining the Web Chakrabarti & Ramakrishnan 92 Limitations • Prefix representations • ad-hoc • inflexible. • Generalisibility: • How to incorporate additional features ? • E.g.: adding features derived from hyperlinks. • Relations • uniform way to codify hypertextual features. • Example:
  • 93. Mining the Web Chakrabarti & Ramakrishnan 93 Rule Induction for relational learning • Inductive classifiers • discover rules from a collection of relations. • Example solution for above • Goal : Discover a set of predicate rules • Consider 2 class setting • Positive examples D+ and negative examples D- • Test instance: • True => positive instance. Else negative instance.
  • 94. Mining the Web Chakrabarti & Ramakrishnan 94 Rule induction with First Order Inductive Logic (FOIL) • Well-known rule learner • Start with empty rule set 1. learn new (disjunctive) rule 2. add conjunctive literals to the new rule until no negative example is covered by the new rule. 3. pick a literal which increases the ratio of surviving positive to negative bindings rapidly. 4. Remove positive examples covered by any rule generated thus far.
  • 95. Mining the Web Chakrabarti & Ramakrishnan 95 Literals Explored • where Q is a relation and Xi are variables, at least one of which must be already bound. • not(L), where L is a literal of the above forms. constantaiscandvariablesareX,Xwhere;XX,XXc,X,XX jijijiiji  ),......XQ(X k1
  • 96. Mining the Web Chakrabarti & Ramakrishnan 96 Analysis • Can learn class labels for individual pages • Can learn relationships between labels • member(homePage, department) • teaches(homePage, coursePage) • advises(homePage, homePage) • writes(homePage, paper) • Hybrid approaches • Statistical classifier • more complex search for literals • Inductive learning • comparing the estimated probabilities of various classes. • Recursively labeling relations • E.g.: relating page label in terms of labels of neighboring pages  classified(A, facultyPage) :-  links-to(A, B), classified(B, studentPage),  links-to(A, C), classified(C, coursePage),  links-to(A, D), classified(D, publicationsPage).