Web Information Extraction Learning based on Probabilistic Graphical Models - Presentation Transcript
Web Information Extraction Learning based on Probabilistic Graphical Models Wai Lam Joint work with Tak-Lam Wong The Chinese University of Hong Kong
Introduction
Building advanced Web mining applications
requires precise text information extraction
a large number of different Web sites.
Substantial human effort is needed for the information extraction task.
diverse layout format
content variation
Wrapper Adaptation Problem (1)
Wrapper Adaptation Problem (2) Learned wrapper Wrapper learning
Product Attribute Extraction and Resolution Problem (1)
The Web contains a huge number of online stores selling millions of different kinds of products.
Product Attribute Extraction and Resolution Problem (2)
Traditional search engines typically treat every term in a Web document in a uniform fashion.
Consider the digital camera domain. Suppose a user supplies a query: “auto white balance” trying to find cameras related to the product attribute “white balance”.
Possible results: “auto ISO” which is about “light sensitivity” different from the product attribute “white balance”
Product Attribute Extraction and Resolution Problem (3)
Another related desirable task is to resolve the extracted data according to their semantics.
This can improve indexing of product Web pages and support intelligent tasks such as product search or product matching.
Our Approach
We have investigated learning frameworks for solving each of the Web information extraction tasks just presented.
Probabilistic graphical models provide a principled paradigm harnessing the uncertainty during the learning process.
A graphical model capturing information extraction knowledge for solving wrapper adaptation (ACM TOIT 2007).
A graphical model for unsupervised learning to extract and resolve product attributes (SIGIR 2008).
Motivating Example (Source: http://www.superwarehouse.com ) (Source: http://www.crayeon3.com )
Product Attribute Extraction
To extract product attributes:
In the beginning, only the attribute “resolution” is known.
Effective sensor resolution
Layout format
White balance, shutter speed
Mutual cooperation
Light sensitivity
Product Attribute Resolution
Samples of extracted text fragments from a page:
cloudy , daylight , etc…
What do they refer to?
A text fragment extracted from another page:
white balance auto , daylight , cloudy , tungsten , … …
Product attribute resolution:
To cluster text fragments of attributes into the same group
The wrapper learned from a Web site cannot be applied to other sites.
Template-independent extraction ( Zhu et al., SIGKDD 2007 )
They cannot handle previously unseen attributes.
Existing Works (Unsupervised Learning)
Handle Web pages generated from the same template (Crescenzi et al., VLDB 2001).
Data may not be synchronized
“ Aug 1993 $16.38” extracted from a page
“ Paperback Feb 1985 $6.95” extracted from another page
Synchronized data extraction ( Chuang et al., VLDB 2007 )
Requires a field model (HMM models) for each field
and it requires manually prepared training examples.
Can only apply to Web pages that contain multiple records.
Our Framework
Unsupervised learning framework for jointly extracting and resolving product attributes from different Web sites (SIGIR 2008).
Our framework consists of a graphical model which considers page-independent content information and page-dependent layout information.
Can extract unlimited number of product attributes ( Dirichlet process prior )
The resolved product attributes can be used for other intelligent tasks such as product search (AAAI 2008).
Problem Definition (1)
A product domain,
E.g., Digital camera domain
A set of reference attributes ,
E.g., “ resolution ”, “ white balance ”, etc.
A special element, , representing “ not-an-attribute ”
A collection of Web pages from any Web sites, , each of which contains a single product
Let be any text fragment from a Web page
Problem Definition (2) <TR> <TD> <P> <SPAN> White balance </SPAN> </P> </TD> <TD> <P> <SPAN> Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom </SPAN> </P> </TD> </TR> <TR> Line separator Line separator
Problem Definition (3) Attribute information Target information Layout information Content information White balance Auto, daylight, … … boldface, in-table 1 (related to attribute) white balance
Problem Definition (4) Attribute information Target information Layout information Content information View larger image boldface, underline 0 (irrelevant) not-an-attribute
Attribute extraction:
Attribute resolution:
Joint attribute extraction and resolution:
Problem Definition (5) Attribute information Target information Layout information Content information
Graphical Models (1)
A graphical model is a family of probability distributions defined in terms of a directed or undirected graph.
Nodes : Random variables
Joint distribution : The products over functions defined on the connected nodes
It provides general algorithms to compute marginal and conditional probability of interest.
It provides control over the computational complexity associated with these operations.
Graphical Models (2)
One kind of graphical models is directed graph .
Let be a directed acyclic graph
are the nodes
are the edges
Denote as the parents of .
Denote as the collection of random variables indexed by the nodes.
The joint probability distribution is expressed as:
Graphical Models (3)
E.g.:
This model asserts that the variables Z N are conditionally independent and identically distributed given θ .
Z 1 Z 2 Z 3 Z N θ
Graphical Models (4)
A plate is used to show the repetition of the variables.
Hence, it shows the factorial and nested structures.
Z n θ N
Graphical Models (5)
A generative approach to clustering:
pick one of clusters from a distribution
generate a data point from a cluster-specific probability distribution.
This yields a finite mixture model:
where and are the parameters, and where each cluster has the same parameterized family.
Data are assumed to be generated conditionally IID from this mixture.
Finite Mixture Model
Graphical Models (6)
Mixture models make the assumption that each data point arises from a single mixture component.
the k -th cluster is by definition the set of data points arising from the k -th mixture component.
Finite Mixture Model
Graphical Models (7)
Another way to express this: define an underlying measure
where is an atom at .
And define the process of obtaining a sample from a finite mixture model as follows. For :
Note that each is equal to one of the underlying .
indeed, the subset of that maps to is exactly the k -th cluster.
Finite Mixture Model
Graphical Models (8) θ i N x i G Finite Mixture Model
Graphical Models (9)
Define a countably infinite mixture model by taking K to infinity and hoping that means something, where
Dirichlet Process Mixture π k ψ k G 0 Z i N x i α
Our Model (1)
Our graphical model can be regarded as an extension of Dirichlet mixture model.
Each mixture component
refers to a reference attribute;
consists of two distributions characterizing the content information and target information.
Dirichlet process prior is employed.
It can handle unlimited number of reference attributes.
Attribute extraction:
Attribute resolution:
Joint attribute extraction and resolution:
Our Model (2) Attribute information Target information Layout information Content information
Our Model (3) Dirichlet Process Prior ( Infinite Mixture Model ) N Text Fragment S Different Web Site
Our Model (4) N Text Fragment Target information Layout information Content information Dirichlet Process Prior ( Infinite Mixture Model ) The proportion of the k -th component in the mixture Content information parameter of the k -th component Target information parameter of the k-th component
Our Model (5) S Different Web Site Site-dependent Layout format
Our Model (6) Dirichlet Process Prior ( Infinite Mixture Model ) Concentration parameter for DP Base distribution for content info. Base distribution for target info.
Generation Process (1)
Generation Process (2)
The joint probability for generating a particular text fragment given the parameters, , , , and, :
Inference:
where , , and are the set of observable variables, unobservable variables, and model parameters respectively.
Intractable
Variational Method (1)
The inference problem is transformed into an optimization problem.
The resulting variational optimization problems admit principled approximate solutions.
The solution to variational problems is often given in terms of fixed point equations that capture necessary conditions for optimality.
In contrast to other approximation methods such as MCMC, variational methods are deterministic.
Variational Method (2)
Finding is intractable
Our goal: Transform the problem into an optimization problem:
where D denotes KL-divergence
KL-divergence must be non-negative
Variational Method (3)
KL-divergence is zero if equals the true posterior probability .
Let
By maximizing w.r.t. we get:
Therefore, we have a lower bound on the desired log-marginal probability
LHS is the log-likelihood of the observable variables.
.
Variational Method (4)
The problem becomes maximizing .
Variational Method (5)
Truncated stick-breaking process ( Ishwaran and James, 2001 )
Replace infinity with a truncation level K
Variational Method (6) Mixture of tokens Binary A set of binary features Conjugate priors
Variational Method (7)
Solve by coordinate ascent algorithm
One important variational parameters:
How likely does come from the k -th component?
Attribute resolution!
Variational Method (8)
Another important variational parameter:
where
How likely should be extracted?
Attribute extraction!
Variational Method (9)
Other variational parameters:
Initialization
What should be extracted?
Make use of a very small amount of prior information about a domain.
Only a few terms about the product attributes
E.g., resolution , light sensitivity
Can be easily obtained, for example, by just highlighting the attributes of one single Web page
Initialization
EM Algorithm for Layout Parameters
Our framework can consider the page-dependent layout format of text fragments to enhance extraction.
However, the layout information of an unseen Web page is unknown and hence we cannot predefine or estimate the values of .
E-step:
Apply coordinate ascent algorithm until convergence to achieve the optimal conditions for all variational parameters.
M-step:
Calculate
Experiments
We have conducted experiments on four different domains:
Digital camera: 85 Web pages from 41 different sites
MP3 player: 96 Web pages from 62 different sites
Camcorder: 111 Web pages from 61 different sites
Restaurant: 29 Web pages from LA-Weekly Restaurant Guide
In each domain, we conducted 10 runs of experiments.
In each run, we randomly selected a Web page and pick a few terms inside for initialization.
The top five weighted terms in the ten largest resolved attributes in the digital camera domain:
Evaluation on Attribute Extraction
Surprisingly, in the restaurant domain, our framework achieves a performance ( 0.95 F1-measure ) which is comparable to the supervised method ( Muslea et al. 2001 )
Conclusions
We investigate learning frameworks automating and adapting the extraction task based on probabilistic graphical models which provide a principled paradigm harnessing the uncertainty during the learning process.
We have developed a graphical model, which employs Dirichlet process prior, to model the generation of text fragments in Web pages for solving the tasks of product attribute extraction and resolution from different Web sites.
An unsupervised inference algorithm based on variational method is designed.
We formally show that content and layout information can collaborate and improve both extraction and resolution performance under our model.
Questions and Answers
Variational Method (1)
Finding is intractable
Our goal: Transform the problem into an optimization problem:
Since KL divergence must be non-negative
LHS is the log-likelihood of the observable variables
Variational Method (2)
KL divergence:
The problem becomes maximizing
Variational Method (3)
Truncated stick-breaking process ( Ishwaran and James, 2001 )
Replace infinity with a truncation level K
Variational Inference (4) Mixture of tokens Binary A set of binary features Conjugate priors
Variational Method (5)
After applying the truncated stick-breaking process:
Variational Method (6)
Solve by coordinate ascent.
Differentiate the formula and set to zero:
Variational Method (7)
One important variational parameters:
How likely does come from the k -th component?
Attribute resolution!
Variational Method (8)
Another important variational parameter:
where
How likely should be extracted?
Attribute extraction!
Unsupervised Approach
We make use of the prior knowledge, which is in the form of a list of a few terms, denoted as , related to product attributes.
Let be the i -th term in the list.
The terms are not required to be categorized into different attributes.
For each ,we select the i -th component in our model and set a higher value of if is equal to the , and zero otherwise.
In particular, we set to 10 for such .
Next, for these components, we set and .
This essentially means that 6 out of 10 text fragments in this component will be a text fragment related to attribute values.
0 comments
Post a comment