A semi-supervised method for efficient construction of statistical spoken language understanding resources
A Semi-supervised Method for Efficient Construction of Statistical Spoken Language Understanding Resources Seokhwan Kim, Minwoo Jeong, and Gary Geunbae Lee Pohang University of Science and Technology (POSTECH), South Korea ABSTRACT EXTRACTING CONTEXT PATTERNS SCORING CANDIDATES We present a semi-supervised framework to construct spoken ● The score of the alignment between a raw utterance and a To overcome the context sparseness problem of spoken utterances, we make use of not sub-phrases of an utterance, but the full context patternlanguage understanding resources with very low cost. We utterance itself as a context pattern for extracting named entities in the utterance. First, we assume that each entry in the entity listgenerate context patterns with a few seed entities and a large is absolutely precise and uniquely belong to only a category whether it is a seed entity or an extended entity as an intermediate ofamount of unlabeled utterances. Using these context patterns, the overall procedure. For each entry in the entity list, we find out utterances containing it, and make an utterance template bywe extract new entities from the unlabeled utterances. The replacing the part of entity in the utterance with the defined entity label. In this replacing task, we exclude the entities which are where ref is a context pattern, tar is a raw utterance, n is theextracted entities are appended to the seed entities, and we can located at the beginning or end of the utterance, because context patterns containing the entities located in such positions can lead number of words in ref, m is the number of words in tar, t is theobtain the extended entity list by repeating these steps. Our number of aligned entity labels, and e is the number of words to confusion of determining the boundaries of each entity in the later procedure. extracted as entity candidatesmethod is based on an utterance alignment algorithm which is avariant of the biological sequence alignment algorithm. Using ● The score of an entity candidate ej which is extracted by athis method, we can obtain precise entity lists with high context pattern reficoverage, which is of help to reduce the cost of building ALIGNMENT-BASED NAMED ENTITY RECOGNITIONresources for statistical spoken language understanding systems. We firstly align a raw utterance with a context pattern containing entity labels. Then, from the result of the best alignment between them, we extract the parts of the raw utterance which are aligned to the entity labels in the context pattern as an entity ● The final score of an entity candidate ej MOTIVATION candidate belonging to the category of the corresponding entity label. Spoken Language Understanding (SLU) is a problem ofextracting semantic meanings from recognized utterances andfilling the correct values into a semantic frame structure. Mostof the statistical SLU approaches require a sufficient amount of EXPERIMENTStraining data which consist of labeled utterances with We evaluated our method on the CU-Communicator corpus,corresponding semantic concepts. The task of building the which consists of 13,983 utterances. We chose the three mosttraining data for the SLU problem is one of the most important frequently occurring semantic categories in the corpus, CITYand high-cost required tasks for managing the spoken dialog MATRIX COMPUTATION TRACE BACK NAME, MONTH, and DAY NUMBER. we empirically set the entity selection threshold value to 0.3.systems. We concentrate on utilizing a semi-supervisedinformation extraction method for building resources for The traceback step is started at the position with ● Result of automatic entity list extensionstatistical SLU modules in spoken dialog systems. maximum score from among the first column and the first row. Then, the next position of the position [i, j] is # of # of determined by following policies. # of Category extended total Precision Recall OVERALL PROCEDURE • If tar(i) and ref(j) are identical, then the next position seeds entities entities is [i + 1, j + 1].1. Prepare seed entity list E and unlabeled corpus C • Otherwise, the position with maximum score from CITY_NAME 20 123 209 65.04% 37.91%2. Find utterances containing lexical of entities in E in the among [i + 1, j + 1] ~ [i + 1, n] and [i + 1, j + 1] ~ [m, j +corpus C, and replace the parts of matched entities in the 1] is the next position. MONTH 1 10 12 100% 83.33%found utterances with a label which indicates the location DAY_NUMBER 3 27 34 100% 79.41%of entities. Add partially labeled utterances to thecontext pattern set P. ● Result of corpus labeling experiment3. Align each utterance in the corpus C with each contextpattern in P, and extract new entity candidates in the utterance Category Precision Recall F-measurewhich is matched with the entity label in the contextpattern. CITY_NAME 91.30 86.83 89.014. Compute the score of extracted entity candidates in step MONTH 98.98 87.24 92.743, and add only high-scored candidates to E. DAY_NUMBER 92.00 82.03 86.735. If there is no additional entities to E in step 4, terminate Overall 93.24 85.53 89.22the process with entity list E, context pattern set P, andpartially labeled corpus C as results. Otherwise, returnto step 2 and repeat the process.