The document presents a method for flexible context extraction from keywords in Russian automatic speech recognition results. It describes rules for extracting a context that includes dependent words, clauses, subjects, and other grammatical elements. Experiments show the method achieves better completeness and conciseness than fixed-window approaches. Future work will improve the syntactic parser and test contexts for clustering tasks.
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Olga Khomitsevich - Flexible Context Extraction for Keywords in Russian Automatic Speech Recognition Results
1. 1
TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)
Subtitle (FORMAT: TAHOMA 22)
FLEXIBLE CONTEXT EXTRACTION FOR
KEYWORDS IN RUSSIAN AUTOMATIC
SPEECH RECOGNITION RESULTS
O. Khomitsevich, K. Boyarsky, E. Kanevsky, A. Bulusheva, V.
Mendelev
bulusheva@speechpro.com
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
AIST 2016
2. 2Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CONTENTS
Introduction
The proposed method
The SemSin system
Rules for context extraction
Examples for context extraction
Experiments and results
Discussion and future developments
3. 3
INTRODUCTION
Issues
Keyword search tasks
Thematic clustering tasks
Existing methods
Output the whole sentence
Output a window of n words to the right and left of the
keyword
Problems
The sentence may be very long
Poorly punctuated recognizer output
The window may miss important information
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
4. 4
THE PROPOSED METHOD
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
5. 5
THE SEMSIN SYSTEM
The SemSin is based on three databased:
Morphological database
Database of idioms
Database of prepositions
SemSin is a system for syntactic and semantic analysis of Russian text. It
combines the functions of a Part-of-Speech tagger, ontology and syntactic parser.
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
6. 6
THE SEMSIN SYSTEM
The SemSin system analyses text by paragraph, involving the following steps:
Each word is processed by the morphological analyser (lemma, POS, grammatical
form, semantic class and syntactic dependents).
The text is tokenized and divided into sentences by the pre-syntax module.
Syntactic parse trees are constructed for each sentence by means of the application
of about 400 rules.
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
7. 7
THE SEMSIN SYSTEM
The following features are represented in a resulting xml
file:
Id is unique ID of the token inside the sentence
lemma is the base form of the word
morph contains the information about the POS and grammatical features of
the word (animacy, gender, number, case, tense, etc)
class number refers to the semantic class of the word
rel is the tag containing information about relations between words in the
sentence
id_head contains Id of the parent node
type indicates the type of the dependency relation between the two words
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
8. 8
THE SEMSIN SYSTEM
A fragment of a resulting file
“Саудовская Аравия предпочитает” (Translation: “Saudi Arabia prefers”)
<w Id="1" lemma="САУДОВСКИЙ" morph="ПРИЛ жр,ед,им" class="$715"> <rel id_head="2"
type="Часть_Назв"/> Саудовская </w>
<w Id="2" lemma="АРАВИЯ" morph="СУЩ но,жр,ед,им" class="$1231000"> <rel id_head="3"
type="Субъект"/> Аравия </w>
<w Id="3" lemma="ПРЕДПОЧИТАТЬ" morph="Г пе,нс,дст,нст,3л, ед" class="$1241/41561"> <rel
id_head="" type=""/> предпочитает </w>
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
9. 9
RULES FOR CONTEXT EXTRACTION
The algorithm extracts:
all words immediately dependent on keyword;
the topmost node of the clause (normally the predicate) and all the
nodes between it and the target word;
the subject of the predicate (unless it is already extracted or
coincides with the target word);
the direct object of the predicate, and, for verbs of the class
“speech/information/reporting”, the object denoting the content of
the report;
prepositional and other groups linked to the predicate by a
“where?”-type link;
all the words in genitive case that depend on those already
extracted;
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
10. 10
EXAMPLE FOR CONTEXT EXTRACTION
News articles:
Keyword:
США ( Translation: “USA”).
The original transcript:
Полевой командир талибов Маулави Сангин сообщил в четверг западным
информационным агентствам, что военнослужащий США, пропавший в
афганской провинции Пактика в конце июня, находится в руках боевиков.
(Translation: “Talib field commander Mawlawi Sangin informed Western information agencies on Thursday
that the USA serviceman who went missing in the Afghan Paktika province in the end of June is in the hands of
militants”).
Context:
военнослужащий США находится в руках боевиков
(Translation: “the USA serviceman is in the hands of militants”)
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
11. 11
EXAMPLE FOR CONTEXT EXTRACTION
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
12. 12
EXAMPLE FOR CONTEXT EXTRACTION
Recognition output:
Keyword:
льготный (Translation: “relating to benefits”).
The recognized transcript:
меня очень интересует, почему у нас так плохо стало с с
лекарством бы льготным лекарствам.
(Approximate translation: “I’m really interested why for us it has become so bad with with
medicine to benefit medicines”).
The original transcript:
меня очень интересует, почему у нас так плохо стало с
лекарством, льготным лекарством
(“I’m really interested why for us it has become so bad with a medicine, a benefit edicine”).
Context:
у нас плохо стало льготным лекарствам
(Translation: “for us it has become bad to benefit medicines”)
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
13. 13
EXAMPLE FOR CONTEXT EXTRACTION
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
14. 14
EXPERIMENTS AND RESULTS
News articles
20 human experts
2 context quality measures(from 1 to 10, the more the better): completeness and
conciseness
Test-case:
500 sentences from news articles, 55 keywords.
237 contexts were extracted.
Algorithm Avg. completness Avg.
conciseness
Context with window n=4 6.2 7.64
Context with window n=5 6.74 7.3
Flexible context extraction 7.34 8.5
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
15. 15
EXPERIMENTS AND RESULTS
Recognition output
Test-case:
2000 sentences were produced by Russian ASR system with 80% accuracy, social
thematic.
23 keywords.
223 contexts were extracted.
Algorithm Avg. completness Avg.
conciseness
Context with window n=4 7.59 7.44
Context with window n=5 7.92 7.24
Flexible context extraction 7.41 8.28
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
16. 16
DISCUSSION AND FUTURE DEVELOPMENTS
We are going to
add new syntactic dependencies;
do a context more shorter or longer according to the user’s need;
include more advanced NLP methods;
make a syntactic parser more robust for spontaneous speech
recognition results;
test the use of the extracting contexts in a clustering task;
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
17. 17
THANK YOU
CONTACTS
Russia
4 Krasutskogo street, St. Petersburg,
196084
Tel.: +7 812 325-8848
Fax: +7 812 327 9297
Email: info@speechpro.com
USA
Suite 316, 369 Lexington ave
New York, NY, 10017
Tel.: +1 646 237 7895
Email: sales-usa@speechpro.com
ABOUT THE COMPANY
STC-Innovations is a leader in the multimodal biometric
market. STC-Innovations develops multimodal biometric
solutions based on person-identifying technologies via voice,
face and other noncontact biometric features.
STC-Innovations is a spin-off company of the Speech
Technologies Center, leading global provider of innovative
systems in high-quality recording, audio and video processing
and analysis, speech synthesis and recognition, and real-time,
high-accuracy voice and facial biometrics solutions with over
20 years of research, development and implementation
experience in Russia and internationally.
STC is ISO-9001: 2008 certified.
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
AIST 2016
Editor's Notes
Для того чтобы заменить картинку необходимо:
Вид-Образец слайдов
Выбрать первый слайд первого образца
Выделяем картинку на первом слое-С помощью клавиши Shift передвигаем картинку в сторону
Выделяем картинку на следующем слое
Меню Формат-Изменить рисунок-Выбираем нужный файл
Возвращаем картинку верхнего слоя на прежнее место
Выходим из режима образца слайда