SlideShare a Scribd company logo
1 of 22
Information & Database Systems Lab




                                     Entity Graph Mining and Matching
                                                                          Seung-won Hwang
                                                                         Associate Professor
                                             Department of Computer Science and Engineering
                                                                            POSTECH, Korea
Mining Human Intelligence from the Web: Click Graph
                                      Language-agnostic/data-intensive: e.g., arabic Corpus?
Information & Database Systems Lab




                                                                  Are q1 and q2 similar?




                                                                  Are u3 and u4 similar?
Mining at Finer Granularity: Named Entity (NE) Graph
                                      Person name, Place name, Organization name, Product name
                                        Newspapers, Web sites, TV programs, …
Information & Database Systems Lab




                                                                                             Apple
                                                                                                                 MS
                                                                                       tenure
                                                                                                          Co-founder
                                                                                            jobs
                                                                                                                 gates
                                                                                                   complicated

                                                                                            Mac
Case I: Matching names with twitter accounts [EDBT11]
Information & Database Systems Lab
Case II: Entity Translation [EMNLP10,CIKM11]
                                      What are the features?
                                      How are the features combined?
                                     (using translation as an application scenario)
Information & Database Systems Lab




                                                                 NE                                      NE
                                                                                                                   NE
                                                      NE
                                                                                               NE
                                                                                NE                            NE
                                                                      NE
                                                                                                                        NE
                                                NE
                                                            NE                       NE   NE        NE
                                                                           NE
                                                                                                                         NE
                                     English                                                                  NE
                                                                                                                              Chinese
                                     Corpus      NE
                                                                                                                              Corpus
                                                                                          NE
                                                                 NE                                 NE
                                                                                     NE

                                                                                                                        NE
                                                           NE                                                 NE
                                                                      NE                       NE



                                                            Ge=(Ve, Ee)                              Gc=(Vc, Ec)
NE Translation
                                      Goal
                                        Finding a NE in source language into its NE in target language
                                        Ex) “Obama” (English)  “奥巴马” (Chinese)
                                      Resources: comparable corpora
Information & Database Systems Lab




                                                                       NEE          NEE
                                                                         Features     Features
                                                                                                                Find!!
                                                                       NEE          NEE
                                                                         Features     Features

                                        Xinhua News Agency (English)
                                                                                                          NEE            NEC

                                                                                                          NEE            NEC
                                                                       NEC          NEC
                                                                                                          NEE            NEC
                                                                         Features     Features

                                                                       NEC          NEC                   NEE            NEC
                                                                         Features     Features

                                        Xinhua News Agency (Chinese)
NE Translation Similarity Features
                                      Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]
                                          Pronunciation similarity between named entities
                                          Ex) “Obama” and “奥巴马” (pronounced Aobama)
Information & Database Systems Lab




                                      Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]
                                          Contextual word similarity between named entities
                                          Ex) The president (总统) Obama (奥巴马)
                                              “As president, Obama signed economic stimulus legislation …”



                                      Relationship Similarity (R): G.-w.You [7]
                                          Co-occurrence similarity between pairs of named entities
                                          Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
Motivation
                                      Taxonomy Table

                                                                        Entity     Relationship
                                        Using Entity Names            E [1,2,3]         R         You [7]
Information & Database Systems Lab




                                        Using Textual Context         EC [4,5,6]        ?
                                                                      Shao [8]




                                     Research questions:
                                        Why RC is not used?
                                        Can all four categories combined?
In this paper…
                                      We propose a new NE translation similarity feature
                                         Relationship Context similarity (RC)
                                            Contextual word similarity between named entities
                                            Ex) pair (“Barack”, “Michelle”)  Spouse
Information & Database Systems Lab




                                      We propose new holistic approaches
                                            Combining all E, EC, R, and RC




                                      We validate our proposed approach using extensive
                                       experiments
Our Framework
                                      We abstract this problem as…
                                      Graph Matching of two NE relationship graphs extracted from
                                       comparable corpora
Information & Database Systems Lab




                                                                                                              Populate a decision matrix
                                                                                                                R, |Ve|-by-|Vc| matrix



                                                                NE                                      NE
                                                                                                                    NE
                                                     NE
                                                                                              NE
                                                                               NE                            NE
                                                                     NE
                                                                                                                         NE
                                               NE
                                                           NE                       NE   NE        NE
                                                                          NE
                                                                                                                          NE
                                     English                                                                 NE
                                                                                                                                    Chinese
                                     Corpus     NE
                                                                                                                                    Corpus
                                                                                         NE
                                                                NE                                 NE
                                                                                    NE

                                                                                                                         NE
                                                          NE                                                 NE
                                                                     NE                       NE



                                                           Ge=(Ve, Ee)                              Gc=(Vc, Ec)
Our Framework
                                      Overview – 3 Steps
                                        Initialization
                                                                                                                 奥巴马        成龙
                                            Construct NE relationship graphs
                                            Build an initial pairwise similarity matrix R0        Obama         .99   .1   .2
Information & Database Systems Lab




                                            Use Entity (E) and Entity Context (EC) similarities
                                                                                                   Jackie chan              .1
                                        Iterative reinforcement
                                            Build a final pairwise similarity matrix R∞
                                            Use Relationship (R) and Relationship Context (RC) similarities


                                        Matching
                                            Find 1:1 matching from R∞
                                                                                                                 奥巴马        成龙
                                            Build a binary hard decision matrix R*
                                                                                                   Obama         .99   .1   .2



                                                                                                   Jackie chan              .99
Initialization
                                      Constructing NE relationship graphs G = (N, E)
                                         Extract NEs using entity tagger for each document in each corpus
                                         Regard NEs that appears more than δ times as Nodes
                                         Connect two Nodes when they co-occur more than δ times
Information & Database Systems Lab




                                      Initializing R0
                                         Computing entity similarity matrix SE
                                             Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’
                                             Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”)


                                                                    E
                                                                                ED(ei , PYC j )
                                                               S   ij   1
                                                                            Len(ei ) Len( PYC j )
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context word
Information & Database Systems Lab




                                               ex) “As president, Obama signed economic stimulus legislation …”




                                             Context window

                                               CW ( NE , d ) {wi   l/2   , wi   l/2 1   ,..., wi ( NE ),..., wi   l/2 1   , wi   l/2   }




                                             Correlation between a NE and a context word : Log-odd ratios
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Projected Context Association Vector
Information & Database Systems Lab




                                               Obama           Score                            奥巴马   Score
                                                 …              …                                …     …
                                              President         0.9                              …     …
                                                 …              …                               总统    0.85
                                                 …              …                                …     …



                                                                                Dictionary
                                     USA
                                                                                     …
                                                                                                美
                                                                                                國
                                                                              (President, 总统)
                                                                                     …
                                                                                     …


                                                          president                                           统总
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context Similarity between ‘ei’ and ‘cj’
                                             Compute cosine similarity between two vectors
Information & Database Systems Lab




                                                                           EC
                                                                                CAei CAc j
                                                                      S   ij
                                                                                CAei    CAc j


                                         Merging SE and SEC
                                             Min-Max normalization in range [0:1]
                                             Merge


                                                                        Rij     SijE SijEC
Reinforcement
                                      Intuition
                                         Two NEs with a strong relationship
                                            Co-occur frequently                    have edge
                                            Share similar context                  have similar relationship context
Information & Database Systems Lab




                                                                                                       NE
                                                                        NE

                                                                                                      context
                                                                  context

                                                            X
                                                                                                                  Y



                                                                 context                                                  context


                                                                        NE
                                                                                                                                NE




                                                       English NE Graph                                      Chinese NE Graph
                                           1. Align neighbors
                                               using relationship (R) and relationship context (RC) similarity
                                           2. Update the similarity score
Reinforcement
                                      Iterative Approach

                                                 Relationship Context (RC) Similarity between
                                                 relation pair (i, u) and (j, v)
Information & Database Systems Lab




                                               Relationship-based Similarity (R & RC)                              Entity-based Similarity (E & EC)

                                                                                            t      RC
                                                                                           Ruv ( Siu , jv )
                                                     Rij 1
                                                       t
                                                                                                              (1           0
                                                                                                                       ) Rij
                                                                             t
                                                                ( u ,v ) k B ( i , j , )          2k


                                      Ordered set of aligned neighbor pairs of (i, j)
                                      at iteration t

                                                                                                   Relationship (R) Similarity of
                                                                                                   i’s neighbor u and j’s neighbor v
Matching
                                      Finding 1:1 matching using greedy algorithm

                                      Steps
Information & Database Systems Lab




                                       1.    Find a translation pair with the highest final similarity score
                                       2.    Select the pair and remove the corresponding row and column from R∞
                                       3.    Repeat 1. and 2. until the similarity score < threshold




                                        R∞
Experiments
                                      Dataset
                                        English Gigaword Corpus
                                            Xinhua News Agency 2008.01~2008.12
                                            100,746 news documents
                                        Chinese Gigaword Corpus
Information & Database Systems Lab




                                            Xinhua News Agency 2008.01~2008.12
                                            88,029 news documents


                                      Approaches
                                          EC                              : consider Entity context similarity feature only
                                          E                               : consider Entity name similarity feature only
                                          Shao (E+EC)                     : combine Entity name & Entity Context similarities
                                          You (E+R)                       : combine Entity name & Relationship similarities
                                          Ours
                                            E+EC+R (when ϒ = 0)
                                            E+EC+R+RC


                                      Measure
                                        Precision, Recall, and F1-score
Experiments
                                      Effectiveness of overall framework
                                         500 person named entities
                                         Set λ = 0.15
                                         5-fold cross-validation for threshold parameter learning
Information & Database Systems Lab




                                      Other type of NE (100 Location named entities)
Directions
                                      Graph matching
                                      Graph cleansing [VLDB11]
                                      Scalable entity search
Information & Database Systems Lab




                                                                  US Presidents
                                                                  Bill Clinton
                                                                  William J Clinton
                                                                  George W. Bush
                                                                  George H.W. Bush
                                                                  Dubya
Thanks
                                      Question?
Information & Database Systems Lab




                                     Visit: www.postech.ac.kr/~swhwang for these papers

More Related Content

More from Michael Shilman

Personal Desire / Design Fiction
Personal Desire / Design FictionPersonal Desire / Design Fiction
Personal Desire / Design FictionMichael Shilman
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Michael Shilman
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine LearningMichael Shilman
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionMichael Shilman
 

More from Michael Shilman (8)

Iterative Prototyping
Iterative PrototypingIterative Prototyping
Iterative Prototyping
 
Personal Desire / Design Fiction
Personal Desire / Design FictionPersonal Desire / Design Fiction
Personal Desire / Design Fiction
 
Data Design
Data DesignData Design
Data Design
 
Data Mining
Data MiningData Mining
Data Mining
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!
 
Class, where are we?
Class, where are we?Class, where are we?
Class, where are we?
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine Learning
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: Introduction
 

Recently uploaded

7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024christinemoorman
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfmuskan1121w
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in managementchhavia330
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst SummitHolger Mueller
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...lizamodels9
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 

Recently uploaded (20)

7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 
The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024The CMO Survey - Highlights and Insights Report - Spring 2024
The CMO Survey - Highlights and Insights Report - Spring 2024
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
rishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdfrishikeshgirls.in- Rishikesh call girl.pdf
rishikeshgirls.in- Rishikesh call girl.pdf
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
GD Birla and his contribution in management
GD Birla and his contribution in managementGD Birla and his contribution in management
GD Birla and his contribution in management
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Progress Report - Oracle Database Analyst Summit
Progress  Report - Oracle Database Analyst SummitProgress  Report - Oracle Database Analyst Summit
Progress Report - Oracle Database Analyst Summit
 
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc.../:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
/:Call Girls In Jaypee Siddharth - 5 Star Hotel New Delhi ➥9990211544 Top Esc...
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.Eni 2024 1Q Results - 24.04.24 business.
Eni 2024 1Q Results - 24.04.24 business.
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 

Seungwon Hwang: Entity Graph Mining and Matching

  • 1. Information & Database Systems Lab Entity Graph Mining and Matching Seung-won Hwang Associate Professor Department of Computer Science and Engineering POSTECH, Korea
  • 2. Mining Human Intelligence from the Web: Click Graph  Language-agnostic/data-intensive: e.g., arabic Corpus? Information & Database Systems Lab Are q1 and q2 similar? Are u3 and u4 similar?
  • 3. Mining at Finer Granularity: Named Entity (NE) Graph  Person name, Place name, Organization name, Product name  Newspapers, Web sites, TV programs, … Information & Database Systems Lab Apple MS tenure Co-founder jobs gates complicated Mac
  • 4. Case I: Matching names with twitter accounts [EDBT11] Information & Database Systems Lab
  • 5. Case II: Entity Translation [EMNLP10,CIKM11]  What are the features?  How are the features combined? (using translation as an application scenario) Information & Database Systems Lab NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 6. NE Translation  Goal  Finding a NE in source language into its NE in target language  Ex) “Obama” (English)  “奥巴马” (Chinese)  Resources: comparable corpora Information & Database Systems Lab NEE NEE Features Features Find!! NEE NEE Features Features Xinhua News Agency (English) NEE NEC NEE NEC NEC NEC NEE NEC Features Features NEC NEC NEE NEC Features Features Xinhua News Agency (Chinese)
  • 7. NE Translation Similarity Features  Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]  Pronunciation similarity between named entities  Ex) “Obama” and “奥巴马” (pronounced Aobama) Information & Database Systems Lab  Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]  Contextual word similarity between named entities  Ex) The president (总统) Obama (奥巴马) “As president, Obama signed economic stimulus legislation …”  Relationship Similarity (R): G.-w.You [7]  Co-occurrence similarity between pairs of named entities  Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
  • 8. Motivation  Taxonomy Table Entity Relationship Using Entity Names E [1,2,3] R You [7] Information & Database Systems Lab Using Textual Context EC [4,5,6] ? Shao [8] Research questions:  Why RC is not used?  Can all four categories combined?
  • 9. In this paper…  We propose a new NE translation similarity feature  Relationship Context similarity (RC)  Contextual word similarity between named entities  Ex) pair (“Barack”, “Michelle”)  Spouse Information & Database Systems Lab  We propose new holistic approaches  Combining all E, EC, R, and RC  We validate our proposed approach using extensive experiments
  • 10. Our Framework  We abstract this problem as…  Graph Matching of two NE relationship graphs extracted from comparable corpora Information & Database Systems Lab Populate a decision matrix R, |Ve|-by-|Vc| matrix NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 11. Our Framework  Overview – 3 Steps  Initialization 奥巴马 成龙  Construct NE relationship graphs  Build an initial pairwise similarity matrix R0 Obama .99 .1 .2 Information & Database Systems Lab  Use Entity (E) and Entity Context (EC) similarities Jackie chan .1  Iterative reinforcement  Build a final pairwise similarity matrix R∞  Use Relationship (R) and Relationship Context (RC) similarities  Matching  Find 1:1 matching from R∞ 奥巴马 成龙  Build a binary hard decision matrix R* Obama .99 .1 .2 Jackie chan .99
  • 12. Initialization  Constructing NE relationship graphs G = (N, E)  Extract NEs using entity tagger for each document in each corpus  Regard NEs that appears more than δ times as Nodes  Connect two Nodes when they co-occur more than δ times Information & Database Systems Lab  Initializing R0  Computing entity similarity matrix SE  Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’  Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”) E ED(ei , PYC j ) S ij 1 Len(ei ) Len( PYC j )
  • 13. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context word Information & Database Systems Lab ex) “As president, Obama signed economic stimulus legislation …”  Context window CW ( NE , d ) {wi l/2 , wi l/2 1 ,..., wi ( NE ),..., wi l/2 1 , wi l/2 }  Correlation between a NE and a context word : Log-odd ratios
  • 14. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Projected Context Association Vector Information & Database Systems Lab Obama Score 奥巴马 Score … … … … President 0.9 … … … … 总统 0.85 … … … … Dictionary USA … 美 國 (President, 总统) … … president 统总
  • 15. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context Similarity between ‘ei’ and ‘cj’  Compute cosine similarity between two vectors Information & Database Systems Lab EC CAei CAc j S ij CAei CAc j  Merging SE and SEC  Min-Max normalization in range [0:1]  Merge Rij SijE SijEC
  • 16. Reinforcement  Intuition  Two NEs with a strong relationship  Co-occur frequently  have edge  Share similar context  have similar relationship context Information & Database Systems Lab NE NE context context X Y context context NE NE English NE Graph Chinese NE Graph 1. Align neighbors using relationship (R) and relationship context (RC) similarity 2. Update the similarity score
  • 17. Reinforcement  Iterative Approach Relationship Context (RC) Similarity between relation pair (i, u) and (j, v) Information & Database Systems Lab Relationship-based Similarity (R & RC) Entity-based Similarity (E & EC) t RC Ruv ( Siu , jv ) Rij 1 t (1 0 ) Rij t ( u ,v ) k B ( i , j , ) 2k Ordered set of aligned neighbor pairs of (i, j) at iteration t Relationship (R) Similarity of i’s neighbor u and j’s neighbor v
  • 18. Matching  Finding 1:1 matching using greedy algorithm  Steps Information & Database Systems Lab 1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞ 3. Repeat 1. and 2. until the similarity score < threshold R∞
  • 19. Experiments  Dataset  English Gigaword Corpus  Xinhua News Agency 2008.01~2008.12  100,746 news documents  Chinese Gigaword Corpus Information & Database Systems Lab  Xinhua News Agency 2008.01~2008.12  88,029 news documents  Approaches  EC : consider Entity context similarity feature only  E : consider Entity name similarity feature only  Shao (E+EC) : combine Entity name & Entity Context similarities  You (E+R) : combine Entity name & Relationship similarities  Ours  E+EC+R (when ϒ = 0)  E+EC+R+RC  Measure  Precision, Recall, and F1-score
  • 20. Experiments  Effectiveness of overall framework  500 person named entities  Set λ = 0.15  5-fold cross-validation for threshold parameter learning Information & Database Systems Lab  Other type of NE (100 Location named entities)
  • 21. Directions  Graph matching  Graph cleansing [VLDB11]  Scalable entity search Information & Database Systems Lab US Presidents Bill Clinton William J Clinton George W. Bush George H.W. Bush Dubya
  • 22. Thanks  Question? Information & Database Systems Lab Visit: www.postech.ac.kr/~swhwang for these papers