Information Extraction
Yunyao Li
EECS /SI 767
03/29/2006
The Problem
Date
Time: Start - End
Location
Speaker

Person
What is “Information Extraction”
As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:0...
What is “Information Extraction”
As a task:

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:0...
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October...
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October...
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering
October...
What is “Information Extraction”
Information Extraction =
segmentation + classification + association + clustering

Today,...
Live Example: Seminar
Landscape of IE Techniques
Classify Pre-segmented
Candidates

Classifier
which class?

Try alternate
window sizes:

Contex...
Markov Property
S1: rain
S2: cloud
S3: sun

1/2

The state of a system at time t+1,
qt+1, is conditionally independent of
...
Markov Property
S1: rain
S2: cloud
S3: sun

1/2

State-transition probabilities,

S2
1/3

1/2

S1

A=

2/3

0
1
 0
 0 ....
Hidden Markov Model
S1: rain
S2: cloud
S3: sun

1/2
1/10

S2
1/2

S1
4/5

9/10
1/3

2/3
1

1/5

state sequences

O1

S3
3/...
IE with Hidden Markov Model
Given a sequence of observations:
SI/EECS 767 is held weekly at SIN2 .

and a trained HMM:

Fi...
Name Entity Extraction
[Bikel, et al 1998]

Person
end-ofsentence

start-ofsentence

Org
(Five other name classes)

Other
...
Name Entity Extraction
Transition
probabilities

Observation
probabilities

P(st | st-1, ot-1 )

P(ot | st , st-1 )
or P(o...
Training: Estimating Probabilities
Back-Off
“unknown words” and insufficient training data
Transition
probabilities

Observation
probabilities

P(st | st-1 )...
HMM-Experimental Results
Train on ~500k words of news wire text.
Results:
Learning HMM for IE
[Seymore, 1999]

Consider labeled, unlabeled, and distantly-labeled data
Some Issues with HMM
• Need to enumerate all possible observation
sequences
• Not practical to represent multiple interact...
Maximum Entropy Markov Models
[Lafferty, 2001]
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is ...
MEMM
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in...
HMM vs. MEMM
St-1

St

St+1
...

Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 )
i

Ot-1

Ot

St-1

Pr( s | o) = ∏ Pr( si |...
Label Bias Problem with MEMM
Consider this MEMM

Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r)
Pr(12|ri) = Pr(2|1,ri)...
Solve the Label Bias Problem
• Change the state-transition structure of the
model

– Not always practical to change the se...
Random Field

Courtesy of Rongkun Shen
Conditional Random Field

Courtesy of Rongkun Shen
Conditional Distribution
If the graph G = (V, E) of Y is a tree, the conditional distribution over
the label sequence Y = ...
Conditional Distribution
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:



1...
HMM like CRF
Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF

Yt-1

Xt-...
HMM like CRF
For a chain structure, the conditional probability of a label
sequence can be expressed in matrix form.
For e...
HMM like CRF
The normalization function is the (start, stop) entry of the product
of these matrices

The conditional proba...
Parameter Estimation
The problem: determine the parameters
From training data
with empirical distribution

The goal: maxim...
Parameter Estimation –
Iterative Scaling Algorithms
Update the weights as
Appropriately chosen

and

for

for edge feature...
Algorithm S
Define slack feature:

p(y y)
For each index i = 0, …, n+1 we' , define forward vectors

And backward vectors
Algorithm S
=
=

∑
x

n

p (x )∑

n =1

∑

y 'y

p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x )

α i−1 ( y '| x ...
Algorithm S

The rate of convergence is governed by step size
which is Inversely proportional to constant S, but S is gene...
Algorithm T
Keeps track of partial T total. It accumulates feature expectations
into counters indexed by T(x)

Use forward...
Experiments
• Modeling label bias problem
– 2000 training and 500 test samples generated by
HMM
– CRF error is 4.6%
– MEMM...
Experiments
• Modeling mixed order sources
– CRF converge in 500 iterations
– MEMM converge in 100 iterations
MEMM vs. HMM
The HMM outperforms the MEMM
CRF vs. MEMM
CRF usually outperforms the MEMM
CRF vs. HMM
Each open square represents a data set with α < ½, and a sold
square indicates a data set with a α ≥ ½. When t...
POS Tagging Experiments
• First-order HMM, MEMM and CRF model
• Data set: Penn Tree bank
• 50-50% test-train split

• Uses...
Interactive IE using CRF

Interactive parser updates IE results according to user’s
changes. Color coding used to alert th...
Some IE tools Available
• MALLET (UMass)
–
–
–
–

statistical natural language processing,
document classification,
cluste...
MinorThird
• http://minorthird.sourceforge.net/
• “a collection of Java classes for storing
text, annotating text, and lea...
GATE
• http://gate.ac.uk/ie/annie.html
• leading toolkit for Text Mining
• distributed with an Information Extraction
comp...
Sunita Sarawagi's CRF package
• http://crf.sourceforge.net/
• A Java implementation of conditional
random fields for seque...
UIMA (IBM)
• Unstructured Information
Management Architecture.
– A platform for unstructured
information management soluti...
Some Interesting Website based
on IE
• ZoomInfo
• CiteSeer.org

(some of us using it everyday!)

• Google Local, Google Sc...
Upcoming SlideShare
Loading in …5
×

Information Extraction --- An one hour summary

510 views

Published on

This is the deck that I made when taking CS767 at Univ. of Michigan in 2006. While it is a few years' old, it is still a useful deck for people who are new to information extraction.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
510
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
25
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Pr(2|1,r): Independent of observation.
  • Information Extraction --- An one hour summary

    1. 1. Information Extraction Yunyao Li EECS /SI 767 03/29/2006
    2. 2. The Problem Date Time: Start - End Location Speaker Person
    3. 3. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. NAME TITLE ORGANIZATION "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Courtesy of William W. Cohen
    4. 4. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Courtesy of William W. Cohen
    5. 5. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft aka “named entity Gates extraction” Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
    6. 6. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
    7. 7. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
    8. 8. What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation NAME Bill Gates Bill Veghte Richard Stallman For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. October 14, 2002, 4:00 a.m. PT Courtesy of William W. Cohen
    9. 9. Live Example: Seminar
    10. 10. Landscape of IE Techniques Classify Pre-segmented Candidates Classifier which class? Try alternate window sizes: Context Free Grammars Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN which class? Abraham Lincoln was born in Kentucky. Our Focus NNP today! NNP Most likely state sequence? V V P Classifier VP NP BEGIN END BEGIN NP PP which class? END ? Boundary Models Classifier ars e Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. ely p member? Abraham Lincoln was born in Kentucky. l ik Abraham Lincoln was born in Kentucky. Sliding Window Mo st Lexicons VP S Courtesy of William W. Cohen
    11. 11. Markov Property S1: rain S2: cloud S3: sun 1/2 The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt S2 1/3 1/2 S1 2/3 1 S2 In another word, current state determines the probability distribution for the next state.
    12. 12. Markov Property S1: rain S2: cloud S3: sun 1/2 State-transition probabilities, S2 1/3 1/2 S1 A= 2/3 0 1  0  0 .5 0 .5 0    0.67 0.33 0   S3 1 Q: given today is sunny (i.e., q1=3), what is the probability of “sun-cloud” with the model?
    13. 13. Hidden Markov Model S1: rain S2: cloud S3: sun 1/2 1/10 S2 1/2 S1 4/5 9/10 1/3 2/3 1 1/5 state sequences O1 S3 3/10 7/10 O2 O3 O4 observations O5
    14. 14. IE with Hidden Markov Model Given a sequence of observations: SI/EECS 767 is held weekly at SIN2 . and a trained HMM: Find the most likely state sequence: (Viterbi) course name location name background   arg max s P ( s , o ) SI/EECS 767 is held weekly at SIN2 Any words said to be generated by the designated “course name” state extract as a course name: Course name: SI/EECS 767
    15. 15. Name Entity Extraction [Bikel, et al 1998] Person end-ofsentence start-ofsentence Org (Five other name classes) Other Hidden states
    16. 16. Name Entity Extraction Transition probabilities Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or P(ot | st , ot-1 ) (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class
    17. 17. Training: Estimating Probabilities
    18. 18. Back-Off “unknown words” and insufficient training data Transition probabilities Observation probabilities P(st | st-1 ) P(ot | st ) P(st ) P(ot )
    19. 19. HMM-Experimental Results Train on ~500k words of news wire text. Results:
    20. 20. Learning HMM for IE [Seymore, 1999] Consider labeled, unlabeled, and distantly-labeled data
    21. 21. Some Issues with HMM • Need to enumerate all possible observation sequences • Not practical to represent multiple interacting features or long-range dependencies of the observations • Very strict independence assumptions on the observations
    22. 22. Maximum Entropy Markov Models [Lafferty, 2001] identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations Pr( st | xt ) = ... Courtesy of William W. Cohen
    23. 23. MEMM identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Pr( st | xt , st −1, st − 2, ...) = ... Courtesy of William W. Cohen
    24. 24. HMM vs. MEMM St-1 St St+1 ... Pr( s, o) = ∏ Pr( si | si −1 ) Pr(oi | si −1 ) i Ot-1 Ot St-1 Pr( s | o) = ∏ Pr( si | si −1 , oi −1 ) i Ot-1 Ot+1 St Ot St+1 ... Ot+1
    25. 25. Label Bias Problem with MEMM Consider this MEMM Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r) Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r) Pr(2|1,o) = Pr(2|1,r) = 1 Pr(12|ro) = Pr(12|ri) But it should be Pr(12|ro) < Pr(12|ri)!
    26. 26. Solve the Label Bias Problem • Change the state-transition structure of the model – Not always practical to change the set of states • Start with a fully-connected model and let the training procedure figure out a good structure – Prelude the use of prior, which is very valuable
    27. 27. Random Field Courtesy of Rongkun Shen
    28. 28. Conditional Random Field Courtesy of Rongkun Shen
    29. 29. Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:   pθ (y | x) µ exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷ v∈V ,k  e∈E,k  x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features θ = (λ1 , λ2 ,L , λn ; µ1 , µ 2 ,L , µ n ); λk and µ k are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v
    30. 30. Conditional Distribution • CRFs use the observation-dependent normalization Z(x) for the conditional distributions:   1 pθ (y | x) = exp  ∑ λk f k (e, y |e , x) + ∑ µ k g k (v, y |v , x) ÷ Z (x) v∈V ,k  e∈E,k  Z(x) is a normalization over the data sequence x
    31. 31. HMM like CRF Single feature for each state-state pair (y’,y) and stateobservation pair in the data to train CRF Yt-1 Xt-1 Yt Xt Yt+1 ... = 1 if yu = y’ and yv = y  0 otherwise Xt+1 = 1 if yv = y and xv = x  0 otherwise λy’,y and µy,x are equivalent to the logarithm of the HMM transition probability Pr(y’|y) and observation probability Pr(x|y)
    32. 32. HMM like CRF For a chain structure, the conditional probability of a label sequence can be expressed in matrix form. For each position i in the observed sequence x, define matrix Where ei is the edge with label (yi-1, yi) and vi is the vertex with label yi
    33. 33. HMM like CRF The normalization function is the (start, stop) entry of the product of these matrices The conditional probability of label sequence y is: where, y0 = start and yn+1 = stop
    34. 34. Parameter Estimation The problem: determine the parameters From training data with empirical distribution The goal: maximize the log-likelihood objective function
    35. 35. Parameter Estimation – Iterative Scaling Algorithms Update the weights as Appropriately chosen and for for edge feature fk is the solution of T(x, y) is a global property of (x,y) and efficiently computing the Right-hand sides of the above equation is a problem
    36. 36. Algorithm S Define slack feature: p(y y) For each index i = 0, …, n+1 we' , define forward vectors And backward vectors
    37. 37. Algorithm S = = ∑ x n p (x )∑ n =1 ∑ y 'y p ( y ' , y , x ) f k ( e i , y | e i = ( y ', y ) , x ) α i−1 ( y '| x ) M i ( y ' , y | x ) β i ( y | x ) p ( y ', y , x ) = Z (x) =
    38. 38. Algorithm S The rate of convergence is governed by step size which is Inversely proportional to constant S, but S is generally quite large, resulting in slow convergence.
    39. 39. Algorithm T Keeps track of partial T total. It accumulates feature expectations into counters indexed by T(x) Use forward-back ward recurrences to compute the expectation ak,t of feature fk and bk,t of feature gk given that T(x) = t
    40. 40. Experiments • Modeling label bias problem – 2000 training and 500 test samples generated by HMM – CRF error is 4.6% – MEMM error is 42% CRF solves label bias problem
    41. 41. Experiments • Modeling mixed order sources – CRF converge in 500 iterations – MEMM converge in 100 iterations
    42. 42. MEMM vs. HMM The HMM outperforms the MEMM
    43. 43. CRF vs. MEMM CRF usually outperforms the MEMM
    44. 44. CRF vs. HMM Each open square represents a data set with α < ½, and a sold square indicates a data set with a α ≥ ½. When the data is mostly second order α ≥ ½, the discriminatively trained CRF usually outperforms the MEMM
    45. 45. POS Tagging Experiments • First-order HMM, MEMM and CRF model • Data set: Penn Tree bank • 50-50% test-train split • Uses MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.
    46. 46. Interactive IE using CRF Interactive parser updates IE results according to user’s changes. Color coding used to alert the ambiguity of IE.
    47. 47. Some IE tools Available • MALLET (UMass) – – – – statistical natural language processing, document classification, clustering, information extraction – other machine learning applications to text. • Sample Application: GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.
    48. 48. MinorThird • http://minorthird.sourceforge.net/ • “a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text” • Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information)
    49. 49. GATE • http://gate.ac.uk/ie/annie.html • leading toolkit for Text Mining • distributed with an Information Extraction component set called ANNIE (demo) • Used in many research projects – Long list can be found on its website – Under integration of IBM UIMA
    50. 50. Sunita Sarawagi's CRF package • http://crf.sourceforge.net/ • A Java implementation of conditional random fields for sequential labeling.
    51. 51. UIMA (IBM) • Unstructured Information Management Architecture. – A platform for unstructured information management solutions from combinations of semantic analysis (IE) and search components.
    52. 52. Some Interesting Website based on IE • ZoomInfo • CiteSeer.org (some of us using it everyday!) • Google Local, Google Scholar • and many more…

    ×