Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results

1
TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)
Subtitle (FORMAT: TAHOMA 22)
FLEXIBLE CONTEXT EXTRACTION FOR
KEYWORDS IN RUSSIAN AUTOMATIC
SPEECH RECOGNITION RESULTS
O. Khomitsevich, K. Boyarsky, E. Kanevsky, A. Bulusheva, V.
Mendelev
bulusheva@speechpro.com
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
AIST 2016

2Financially supported by the Ministry of Education and Science
CONTENTS
Introduction
The proposed method
The SemSin system
Rules for context extraction
Examples for context extraction
Experiments and results
Discussion and future developments

3
INTRODUCTION
Issues
 Keyword search tasks
 Thematic clustering tasks
Existing methods
 Output the whole sentence
 Output a window of n words to the right and left of the
keyword
Problems
 The sentence may be very long
 Poorly punctuated recognizer output
 The window may miss important information

4
THE PROPOSED METHOD

5
THE SEMSIN SYSTEM
The SemSin is based on three databased:
 Morphological database
 Database of idioms
 Database of prepositions
SemSin is a system for syntactic and semantic analysis of Russian text. It
combines the functions of a Part-of-Speech tagger, ontology and syntactic parser.

6
THE SEMSIN SYSTEM
The SemSin system analyses text by paragraph, involving the following steps:
Each word is processed by the morphological analyser (lemma, POS, grammatical
form, semantic class and syntactic dependents).
The text is tokenized and divided into sentences by the pre-syntax module.
Syntactic parse trees are constructed for each sentence by means of the application
of about 400 rules.

7
THE SEMSIN SYSTEM
The following features are represented in a resulting xml
file:
 Id is unique ID of the token inside the sentence
 lemma is the base form of the word
 morph contains the information about the POS and grammatical features of
the word (animacy, gender, number, case, tense, etc)
 class number refers to the semantic class of the word
 rel is the tag containing information about relations between words in the
sentence
 id_head contains Id of the parent node
 type indicates the type of the dependency relation between the two words

8
THE SEMSIN SYSTEM
A fragment of a resulting file
“Саудовская Аравия предпочитает” (Translation: “Saudi Arabia prefers”)
<w Id="1" lemma="САУДОВСКИЙ" morph="ПРИЛ жр,ед,им" class="$715"> <rel id_head="2"
type="Часть_Назв"/> Саудовская </w>
<w Id="2" lemma="АРАВИЯ" morph="СУЩ но,жр,ед,им" class="$1231000"> <rel id_head="3"
type="Субъект"/> Аравия </w>
<w Id="3" lemma="ПРЕДПОЧИТАТЬ" morph="Г пе,нс,дст,нст,3л, ед" class="$1241/41561"> <rel
id_head="" type=""/> предпочитает </w>

9
RULES FOR CONTEXT EXTRACTION
The algorithm extracts:
 all words immediately dependent on keyword;
 the topmost node of the clause (normally the predicate) and all the
nodes between it and the target word;
 the subject of the predicate (unless it is already extracted or
coincides with the target word);
 the direct object of the predicate, and, for verbs of the class
“speech/information/reporting”, the object denoting the content of
the report;
 prepositional and other groups linked to the predicate by a
“where?”-type link;
 all the words in genitive case that depend on those already
extracted;

10
EXAMPLE FOR CONTEXT EXTRACTION
News articles:
Keyword:
США ( Translation: “USA”).
The original transcript:
Полевой командир талибов Маулави Сангин сообщил в четверг западным
информационным агентствам, что военнослужащий США, пропавший в
афганской провинции Пактика в конце июня, находится в руках боевиков.
(Translation: “Talib field commander Mawlawi Sangin informed Western information agencies on Thursday
that the USA serviceman who went missing in the Afghan Paktika province in the end of June is in the hands of
militants”).
Context:
военнослужащий США находится в руках боевиков
(Translation: “the USA serviceman is in the hands of militants”)

11

12
Recognition output:
Keyword:
льготный (Translation: “relating to benefits”).
The recognized transcript:
меня очень интересует, почему у нас так плохо стало с с
лекарством бы льготным лекарствам.
(Approximate translation: “I’m really interested why for us it has become so bad with with
medicine to benefit medicines”).
The original transcript:
меня очень интересует, почему у нас так плохо стало с
лекарством, льготным лекарством
(“I’m really interested why for us it has become so bad with a medicine, a benefit edicine”).
Context:
у нас плохо стало льготным лекарствам
(Translation: “for us it has become bad to benefit medicines”)

13

14
EXPERIMENTS AND RESULTS
News articles
20 human experts
2 context quality measures(from 1 to 10, the more the better): completeness and
conciseness
Test-case:
500 sentences from news articles, 55 keywords.
237 contexts were extracted.
Algorithm Avg. completness Avg.
conciseness
Context with window n=4 6.2 7.64
Flexible context extraction 7.34 8.5

15
EXPERIMENTS AND RESULTS
Recognition output
Test-case:
2000 sentences were produced by Russian ASR system with 80% accuracy, social
thematic.
23 keywords.
223 contexts were extracted.
Algorithm Avg. completness Avg.
conciseness
Flexible context extraction 7.41 8.28

16
DISCUSSION AND FUTURE DEVELOPMENTS
We are going to
add new syntactic dependencies;
do a context more shorter or longer according to the user’s need;
include more advanced NLP methods;
make a syntactic parser more robust for spontaneous speech
recognition results;
test the use of the extracting contexts in a clustering task;

17
THANK YOU
CONTACTS
Russia
4 Krasutskogo street, St. Petersburg,
196084
Tel.: +7 812 325-8848
Fax: +7 812 327 9297
Email: info@speechpro.com
USA
Suite 316, 369 Lexington ave
New York, NY, 10017
Tel.: +1 646 237 7895
Email: sales-usa@speechpro.com
ABOUT THE COMPANY
STC-Innovations is a leader in the multimodal biometric
market. STC-Innovations develops multimodal biometric
solutions based on person-identifying technologies via voice,
face and other noncontact biometric features.
STC-Innovations is a spin-off company of the Speech
Technologies Center, leading global provider of innovative
systems in high-quality recording, audio and video processing
and analysis, speech synthesis and recognition, and real-time,
high-accuracy voice and facial biometrics solutions with over
20 years of research, development and implementation
experience in Russia and internationally.
STC is ISO-9001: 2008 certified.
AIST 2016

Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results

Recommended

Recommended

More Related Content

Similar to Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results

Similar to Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results (20)

More from AIST

More from AIST (20)

Recently uploaded

Recently uploaded (20)

Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results

Editor's Notes