Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wikification of Concept Mentions within Spoken Dialogues Using Domain Constraints from Wikipedia


Published on

EMNLP 2015

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Wikification of Concept Mentions within Spoken Dialogues Using Domain Constraints from Wikipedia

  1. 1. Wikification of Concept Mentions within Spoken Dialogues Using Domain Constraints from Wikipedia Seokhwan Kim, Rafael E. Banchs, Haizhou Li Human Language Technology Department, Institute for Infocomm Research (I2 R), Singapore Wikification on Spoken Dialogues Linking mentions to the relevant concepts in Wikipedia Differences between spoken dialogues and written texts Number of speakers Dependencies to background knowledge Degree of informal and noisy expressions Examples of Wikification on Singapore tour guide dialogues Guide How can I help you? Tourist Can you recommend some good places to visit in Singapore? Guide Well if you like to visit an icon of Singapore, Merlion park will be a nice place to visit. Tourist That is a symbol for your country, right? Guide Yes, we use that to symbolise Singapore. Tourist Okay. Guide The lion head symbolised the founding of the island and the fish body just symbolised the humble fishing village. Tourist How can I get there from Orchard Road? Guide You can take the red line train from Orchard and stop at Raffles Place. Tourist Is this walking distance from the station to the destination? Guide Yes, it’ll take only ten minutes on foot. Tourist Alright. Guide Well, you can also enjoy some seafoods at the riverside near the place. Tourist What food do you have any recommendations to try there? Guide If you like spicy foods, you must try chilli crab which is one of our favourite dishes here. Tourist Great! I’ll try that. Singapore, Merlion Park, Orchard Road, North South MRT Line, Raffles Place MRT Station Singapore River, Chilli crab Three-step Approach for Wikification on Dialogues Input Mention mi Linking Validity Analysis In-dialogue Reference Analysis Domain Relevance Analysis Speaker Relatedness Analysis Candidate Generation Wikipedia Concepts History <mj, f(mj)>j=0..(i-1) Candidate Ranking Output Concept f(mi) Step 1 Step 2 Step 3 Step 1: Mention Analysis Analyzing four binary properties of a given mention Linking validity, In-dialogue reference, Domain relevance, Speaker relatedness Guide: In the morning I suggest to you to go to Botanical Garden. LV ID DR SRG SRT - - - - - LV ID DR SRG SRT + - + + - Tourist: Oh, we also have Botanical Garden. LV ID DR SRG SRT + - - - + Tourist: That is actually one of my favourite places here. LV ID DR SRG SRT + + - - + LV ID DR SRG SRT + - - - + Guide: If so, you might like this place also. LV ID DR SRG SRT + + + + - Step 2: Candidate Generation Candidates retrieval from a Lucene index on the Wikipedia collection With filtering constraints based on the analyzed properties in step 1 Combination of multiple constraints: Intersection or Union Step 3: Candidate Ranking Ranking SVM: Supervised learning to rank algorithm s(m, c) =    4 if c is the exactly same as g(m), 3 if c is the parent article of g(m), 2 if c belongs to the same article but different section of g(m), 1 otherwise. m: a mention c: a candidate concept g(m): the manual annotation for the most relevant concept of m Datasets Singapore tour guide dialogues Human-human mixed initiative dialogues 35 sessions, 21 hours, 31,034 utterances Manually annotated with relevant Wikipedia concepts Preprocessed by Stanford CoreNLP toolkit Wikipedia collection 4,797,927 articles and 25,577,464 sections in total Collected from Wikipedia database dump as of January 2015 Indexed into a Lucene index Evaluation: Mention Analysis SVMlight was used for training four mention analyzers With four sets of features: mention (M), utterance (U), dialogue (D), and Wikipedia-based (W) features Five-fold cross validation with F-measure Features LV ID SRG SRT M 86.29 69.15 71.10 72.94 M+U 86.90 70.43 70.43 68.85 M+D 86.17 71.09 70.56 71.52 M+W 86.21 68.96 70.66 71.86 M+U+D 86.82 72.37 70.12 68.30 M+U+W 86.84 70.13 70.19 68.78 M+U+D+W 86.77 72.20 69.94 68.10 Evaluation: Candidate Generation Four sets of candidates were prepared for each mention Baseline: Retrieved with no filtering Intersection: Filtered with intersection of analyzed properties Union: Filtered with union of analyzed properties Oracle: Filtered with manually annotated properties Top 100 candidates were retrieved from a Lucene index for each set Evaluation: Candidate Ranking SVMrank was used for training ranking functions The top-ranked item in the list is considered as the result of Wikification Five-fold cross validation with Precision/Recall/F-measure Method P R F Baseline 26.85 22.52 21.24 Intersection 44.37 27.35 33.84 Union 38.04 31.97 34.74 Manual Filtering 39.90 34.72 37.13 1 Fusionopolis Way, #21-01 Connexis (South Tower), Singapore 138632 Email: