Information & Database Systems Lab                                     Entity Graph Mining and Matching                   ...
Mining Human Intelligence from the Web: Click Graph                                      Language-agnostic/data-intensive...
Mining at Finer Granularity: Named Entity (NE) Graph                                      Person name, Place name, Organi...
Case I: Matching names with twitter accounts [EDBT11]Information & Database Systems Lab
Case II: Entity Translation [EMNLP10,CIKM11]                                      What are the features?                 ...
NE Translation                                      Goal                                        Finding a NE in source l...
NE Translation Similarity Features                                      Entity Name Similarity (E): S.Wan [1], L. Haizhou...
Motivation                                      Taxonomy Table                                                           ...
In this paper…                                      We propose a new NE translation similarity feature                   ...
Our Framework                                      We abstract this problem as…                                      Gra...
Our Framework                                      Overview – 3 Steps                                        Initializat...
Initialization                                      Constructing NE relationship graphs G = (N, E)                       ...
Initialization                                      Initializing R0                                         Computing en...
Initialization                                      Initializing R0                                         Computing en...
Initialization                                      Initializing R0                                         Computing en...
Reinforcement                                      Intuition                                         Two NEs with a stro...
Reinforcement                                      Iterative Approach                                                 Rel...
Matching                                      Finding 1:1 matching using greedy algorithm                                ...
Experiments                                      Dataset                                        English Gigaword Corpus ...
Experiments                                      Effectiveness of overall framework                                      ...
Directions                                      Graph matching                                      Graph cleansing [VLD...
Thanks                                      Question?Information & Database Systems Lab                                  ...
Upcoming SlideShare
Loading in...5
×

Seungwon Hwang: Entity Graph Mining and Matching

936

Published on

This talk introduces the problem of matching web-scale entity graphs, such as multilingual name graphs and social network graphs, to solve difficult problems such as name translation or social id finding. While existing approaches focus on using textual (or phonetic) similarity or Web co-occurrences, this approach combines the strength of the two and significantly outperforms the state-of-the-arts. We present our evaluation results using real-life entity graphs.

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
936
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Seungwon Hwang: Entity Graph Mining and Matching"

  1. 1. Information & Database Systems Lab Entity Graph Mining and Matching Seung-won Hwang Associate Professor Department of Computer Science and Engineering POSTECH, Korea
  2. 2. Mining Human Intelligence from the Web: Click Graph  Language-agnostic/data-intensive: e.g., arabic Corpus?Information & Database Systems Lab Are q1 and q2 similar? Are u3 and u4 similar?
  3. 3. Mining at Finer Granularity: Named Entity (NE) Graph  Person name, Place name, Organization name, Product name  Newspapers, Web sites, TV programs, …Information & Database Systems Lab Apple MS tenure Co-founder jobs gates complicated Mac
  4. 4. Case I: Matching names with twitter accounts [EDBT11]Information & Database Systems Lab
  5. 5. Case II: Entity Translation [EMNLP10,CIKM11]  What are the features?  How are the features combined? (using translation as an application scenario)Information & Database Systems Lab NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  6. 6. NE Translation  Goal  Finding a NE in source language into its NE in target language  Ex) “Obama” (English)  “奥巴马” (Chinese)  Resources: comparable corporaInformation & Database Systems Lab NEE NEE Features Features Find!! NEE NEE Features Features Xinhua News Agency (English) NEE NEC NEE NEC NEC NEC NEE NEC Features Features NEC NEC NEE NEC Features Features Xinhua News Agency (Chinese)
  7. 7. NE Translation Similarity Features  Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]  Pronunciation similarity between named entities  Ex) “Obama” and “奥巴马” (pronounced Aobama)Information & Database Systems Lab  Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]  Contextual word similarity between named entities  Ex) The president (总统) Obama (奥巴马) “As president, Obama signed economic stimulus legislation …”  Relationship Similarity (R): G.-w.You [7]  Co-occurrence similarity between pairs of named entities  Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
  8. 8. Motivation  Taxonomy Table Entity Relationship Using Entity Names E [1,2,3] R You [7]Information & Database Systems Lab Using Textual Context EC [4,5,6] ? Shao [8] Research questions:  Why RC is not used?  Can all four categories combined?
  9. 9. In this paper…  We propose a new NE translation similarity feature  Relationship Context similarity (RC)  Contextual word similarity between named entities  Ex) pair (“Barack”, “Michelle”)  SpouseInformation & Database Systems Lab  We propose new holistic approaches  Combining all E, EC, R, and RC  We validate our proposed approach using extensive experiments
  10. 10. Our Framework  We abstract this problem as…  Graph Matching of two NE relationship graphs extracted from comparable corporaInformation & Database Systems Lab Populate a decision matrix R, |Ve|-by-|Vc| matrix NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  11. 11. Our Framework  Overview – 3 Steps  Initialization 奥巴马 成龙  Construct NE relationship graphs  Build an initial pairwise similarity matrix R0 Obama .99 .1 .2Information & Database Systems Lab  Use Entity (E) and Entity Context (EC) similarities Jackie chan .1  Iterative reinforcement  Build a final pairwise similarity matrix R∞  Use Relationship (R) and Relationship Context (RC) similarities  Matching  Find 1:1 matching from R∞ 奥巴马 成龙  Build a binary hard decision matrix R* Obama .99 .1 .2 Jackie chan .99
  12. 12. Initialization  Constructing NE relationship graphs G = (N, E)  Extract NEs using entity tagger for each document in each corpus  Regard NEs that appears more than δ times as Nodes  Connect two Nodes when they co-occur more than δ timesInformation & Database Systems Lab  Initializing R0  Computing entity similarity matrix SE  Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’  Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”) E ED(ei , PYC j ) S ij 1 Len(ei ) Len( PYC j )
  13. 13. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context wordInformation & Database Systems Lab ex) “As president, Obama signed economic stimulus legislation …”  Context window CW ( NE , d ) {wi l/2 , wi l/2 1 ,..., wi ( NE ),..., wi l/2 1 , wi l/2 }  Correlation between a NE and a context word : Log-odd ratios
  14. 14. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Projected Context Association VectorInformation & Database Systems Lab Obama Score 奥巴马 Score … … … … President 0.9 … … … … 总统 0.85 … … … … Dictionary USA … 美 國 (President, 总统) … … president 统总
  15. 15. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context Similarity between ‘ei’ and ‘cj’  Compute cosine similarity between two vectorsInformation & Database Systems Lab EC CAei CAc j S ij CAei CAc j  Merging SE and SEC  Min-Max normalization in range [0:1]  Merge Rij SijE SijEC
  16. 16. Reinforcement  Intuition  Two NEs with a strong relationship  Co-occur frequently  have edge  Share similar context  have similar relationship contextInformation & Database Systems Lab NE NE context context X Y context context NE NE English NE Graph Chinese NE Graph 1. Align neighbors using relationship (R) and relationship context (RC) similarity 2. Update the similarity score
  17. 17. Reinforcement  Iterative Approach Relationship Context (RC) Similarity between relation pair (i, u) and (j, v)Information & Database Systems Lab Relationship-based Similarity (R & RC) Entity-based Similarity (E & EC) t RC Ruv ( Siu , jv ) Rij 1 t (1 0 ) Rij t ( u ,v ) k B ( i , j , ) 2k Ordered set of aligned neighbor pairs of (i, j) at iteration t Relationship (R) Similarity of i’s neighbor u and j’s neighbor v
  18. 18. Matching  Finding 1:1 matching using greedy algorithm  StepsInformation & Database Systems Lab 1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞ 3. Repeat 1. and 2. until the similarity score < threshold R∞
  19. 19. Experiments  Dataset  English Gigaword Corpus  Xinhua News Agency 2008.01~2008.12  100,746 news documents  Chinese Gigaword CorpusInformation & Database Systems Lab  Xinhua News Agency 2008.01~2008.12  88,029 news documents  Approaches  EC : consider Entity context similarity feature only  E : consider Entity name similarity feature only  Shao (E+EC) : combine Entity name & Entity Context similarities  You (E+R) : combine Entity name & Relationship similarities  Ours  E+EC+R (when ϒ = 0)  E+EC+R+RC  Measure  Precision, Recall, and F1-score
  20. 20. Experiments  Effectiveness of overall framework  500 person named entities  Set λ = 0.15  5-fold cross-validation for threshold parameter learningInformation & Database Systems Lab  Other type of NE (100 Location named entities)
  21. 21. Directions  Graph matching  Graph cleansing [VLDB11]  Scalable entity searchInformation & Database Systems Lab US Presidents Bill Clinton William J Clinton George W. Bush George H.W. Bush Dubya
  22. 22. Thanks  Question?Information & Database Systems Lab Visit: www.postech.ac.kr/~swhwang for these papers
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×