This paper reports on experiments with the first available test collection for searching spontaneous Czech speech transcripts. The collection consists of transcripts from interviews with Holocaust survivors that contain a high word error rate of around 35% due to the nature of the emotional speech. The authors transformed the collection into overlapping 3-minute "documents" and tested searching using word, stem, and lemma forms of queries and documents. They found that using lemmatization or stemming provided significant improvements over raw word matching, doubling average precision. However, the collection was limited by a lack of named entities from the topics present in the transcripts, contributing to poor performance on some topics.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
Ā
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Word2vec on the italian language: first experimentsVincenzo Lomonaco
Ā
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
Ā
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP āPolish National Corpusā). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
Ā
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
Ā
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Word2vec on the italian language: first experimentsVincenzo Lomonaco
Ā
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
Ā
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP āPolish National Corpusā). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
Ā
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
Ā
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
Ā
We aim to model an adaptive log file parser. As the content of log files often evolves over time, we established a dynamic statistical model which learns and adapts processing and parsing rules. First, we limit the amount of unstructured text by clustering based on semantics of log file lines. Next, we only take the most relevant cluster into account and focus only on those frequent patterns which lead to the desired output table similar to Vaarandi [10]. Furthermore, we transform the found frequent patterns and the output stating the parsed table into a Hidden Markov Model (HMM). We use this HMM as a specific, however, flexible representation of a pattern for log file parsing to maintain high quality output. After training our model on one system type and applying it to a different system with slightly different log file patterns, we achieve an accuracy over 99.99%
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
Ā
We aim to model an adaptive log file parser. As the content of log files often evolves over time, we
established a dynamic statistical model which learns and adapts processing and parsing rules. First, we
limit the amount of unstructured text by clustering based on semantics of log file lines. Next, we only take
the most relevant cluster into account and focus only on those frequent patterns which lead to the desired
output table similar to Vaarandi [10]. Furthermore, we transform the found frequent patterns and the
output stating the parsed table into a Hidden Markov Model (HMM). We use this HMM as a specific,
however, flexible representation of a pattern for log file parsing to maintain high quality output. After
training our model on one system type and applying it to a different system with slightly different log file
patterns, we achieve an accuracy over 99.99%.
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsTimdeBoer16
Ā
In ontologies, justifications are used to explain entailments. To simplify these justifications, forgetting is used. We compared several methods of forgetting to estimate the effectiveness of these methods for simplifying the justification.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Ā
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
Ā
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentenceās document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
Ā
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ā
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
Ā
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
Ā
We aim to model an adaptive log file parser. As the content of log files often evolves over time, we established a dynamic statistical model which learns and adapts processing and parsing rules. First, we limit the amount of unstructured text by clustering based on semantics of log file lines. Next, we only take the most relevant cluster into account and focus only on those frequent patterns which lead to the desired output table similar to Vaarandi [10]. Furthermore, we transform the found frequent patterns and the output stating the parsed table into a Hidden Markov Model (HMM). We use this HMM as a specific, however, flexible representation of a pattern for log file parsing to maintain high quality output. After training our model on one system type and applying it to a different system with slightly different log file patterns, we achieve an accuracy over 99.99%
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
Ā
We aim to model an adaptive log file parser. As the content of log files often evolves over time, we
established a dynamic statistical model which learns and adapts processing and parsing rules. First, we
limit the amount of unstructured text by clustering based on semantics of log file lines. Next, we only take
the most relevant cluster into account and focus only on those frequent patterns which lead to the desired
output table similar to Vaarandi [10]. Furthermore, we transform the found frequent patterns and the
output stating the parsed table into a Hidden Markov Model (HMM). We use this HMM as a specific,
however, flexible representation of a pattern for log file parsing to maintain high quality output. After
training our model on one system type and applying it to a different system with slightly different log file
patterns, we achieve an accuracy over 99.99%.
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsTimdeBoer16
Ā
In ontologies, justifications are used to explain entailments. To simplify these justifications, forgetting is used. We compared several methods of forgetting to estimate the effectiveness of these methods for simplifying the justification.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Ā
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
Ā
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentenceās document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
Ā
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ā
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
Ā
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
How to Make a Field invisible in Odoo 17Celine George
Ā
It is possible to hide or invisible some fields in odoo. Commonly using āinvisibleā attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Ā
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
How to Create Map Views in the Odoo 17 ERPCeline George
Ā
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Model Attribute Check Company Auto PropertyCeline George
Ā
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
Ā
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesarās dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empireās birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empireās society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
1.4 modern child centered education - mahatma gandhi-2.pptx
Ā
Analysis And Indexing General Terms Experimentation
1. First Experiments Searching Spontaneous Czech Speech
Pavel Ircing
Department of Cybernetics
University of West Bohemia
Plzen, Czech Republic
ircing@kky.zcu.cz
Douglas W. Oard
College of Information
Studies/UMIACS
University of Maryland
College Park, Maryland
oard@glue.umd.edu
Jan Hoidekr
Department of Cybernetics
University of West Bohemia
Plzen, Czech Republic
hoidekr@kky.zcu.cz
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Con-
tent Analysis and Indexing
General Terms
Experimentation
Keywords
Speech retrieval; Spontaneous speech
1. INTRODUCTION
This paper reports on experiments with the ļ¬rst available
Czech IR test collection. The collection consists of a con-
tinuous stream from automatic transcription of spontaneous
speech (see [3] for details) and the task of the IR system is
to identify appropriate replay points where the discussion
about the queried topic starts. The collection thus lacks
clearly deļ¬ned document boundaries. Moreover, the accu-
racy of the transcription is limited (around 35% word error
rate), mostly due to the nature of the speechāinterviews
with Holocaust survivors, which are sometimes emotional,
accented, and exhibiting age-related speech impediments.
This collection therefore oļ¬ers an excellent opportunity to
explore both eļ¬ects present in Czech (e.g., morphology) and
eļ¬ects that result from processing spontaneous speech. It
was also used in the CL-SR track at the CLEF 2006 evalu-
ation campaign (http://www.clef-campaign.org/).
2. METHODS
Retrieval from a speech stream with unknown topic bound-
aries is an interesting challenge, but that is not our princi-
pal focus in these experiments. We therefore transformed
the collection into artiļ¬cially deļ¬ned set of ādocumentsā
by removing all recognized pauses between words and then
sliding a 3-minute window over the transcripts with a 1-
minute step size. This resulted in a collection of 11,377
overlapping passages, each containing an average of 390 rec-
ognized words (denoted as the asr ļ¬eld) and a set of au-
tomatically produced Czech translations (using techniques
described in [2]) for 20 automatically assigned thesaurus key-
words (using techniques described in [4]) (the ak ļ¬eld). Each
Copyright is held by the author/owner(s).
SIGIRā07, July 23ā27, 2007, Amsterdam, The Netherlands.
ACM 978-1-59593-597-7/07/0007.
word stem lemma
asr 0.0256 0.0494 0.0506
ak 0.0018 0.0022 0.0023
asr.ak 0.0241 0.0447 0.0467
Table 1: Mean GAP, long queries.
ļ¬eld was indexed separately, and a uniļ¬ed index (asr.ak)
was also constructed.
Twenty-nine topics were initially created in English in
the usual TREC-style format (<title>, <desc> and <narr>
ļ¬elds), translated into Czech by a native speaker, and then
checked for natural expression by a second native speaker.
We performed monolingual experiments with ālongā queries
constructed by concatenating the words from all three topic
ļ¬elds.
A morphological analyser was used to obtain the infor-
mation about the lemma (linguistic root form), stem (ap-
proximation to that root form using truncation alone) and
part-of-speech for each Czech word [1] . Three variants of
the collection were indexed, one with only words, one with
only lemmas and one with only stems. Part-of-speech tags
were used as a basis for stopword removalāas we could not
ļ¬nd any decent stoplist for Czech, we simply removed all
words that were tagged as preposition, conjunction, particle
or interjection. In each case, identical processing was done
for the queries. We used Lemur to implement a simple tf.idf
model with blind feedback (using Lemurās standard parame-
ters). Length normalization was not performed because the
collection preprocessing resulted in documents with nearly
identical lengths.
3. EVALUATION
Relevance assessors identiļ¬ed appropriate start times by
interactively searching using manually assigned English the-
saurus terms and the same automatically transcribed con-
tent, ultimately conļ¬rming their decisions by listening to the
audio when the automatically produced transcripts were not
sufficiently accurate to make a deļ¬nitive judgment. Table 1
reports the mean Generalized Average Precision (mGAP),
which is computed in a manner similar to mean average pre-
cision (for details see [3]).
Indexing the ak ļ¬eld, alone or in combination with asr,
proved not to be helpful (although the apparent reduction
when indexed together is not statistically signiļ¬cant (p >
0.05)). Manual examination of a few ak ļ¬elds indeed indi-
cates a low density of terms that appear as if they match
SIGIR 2007 Proceedings Poster
835
2. 0
0.05
0.1
0.15
0.2
0.25
0.3
1166
1181
1185
1187
1225
1286
1288
1310
1311
1321
1508
1620
1630
1663
1843
word
stem
lemma
0
0.05
0.1
0.15
0.2
0.25
0.3
2198
2253
3004
3005
3009
3014
3015
3017
3018
3020
3025
3033
4000
14312
word
stem
lemma
Figure 1: GAP by topic, asr field, long queries.
the content of the passage, but additional analysis will be
needed before we can ascribe blame between the transcrip-
tion, classiļ¬cation and translation stages in the cascade that
produced those keyword assignments. We therefore focus on
results obtained using the asr ļ¬eld alone for the remainder
of our analysis.
It is apparent that some form of linguistic preprocessing
is indeed crucial for Czech. Both lemmatization and stem-
ming boosted the performance almost by a factor of two
in comparison with the word runs, and a Wilcoxon signed-
rank test shows that diļ¬erence to be statistically signiļ¬cant
(p < 0.005). The slight apparent advantage of the lemma run
over the stem run is not statistically signiļ¬cant (p > 0.05).
As Figure 1 shows, substantial variation in GAP is evident
across topics. The four topics with the highest GAP values
(1225, 1630, 2198, 3014) each contain highly discriminative
terms that were correctly transcribed. Topic 1630 exhibits
an enormous diļ¬erence between word matching and match-
ing either stems or lemmas, a vivid reminder of how the
recall-enhancing eļ¬ect of linguistic analysis can dominate
averaged measures (a similar eļ¬ect is also apparent for topic
1310). While a few cases of adverse eļ¬ects from linguis-
tic analysis are visible (most notably with topics 1225 and
1181), these eļ¬ects are generally relatively small. The occa-
sional diļ¬erences between stems and lemmas suggests that
combining evidence from both might help in some cases.
Unsuccessful topics generally either asked about abstract
concepts without using many discriminative terms (e.g., topic
1288: āstrengthening faith during the Holocaustā), or the
discriminative terms for the topic happened to be missing
from the collection. For example, topic 3018 contained a sin-
gle discriminative term that was simply spelled diļ¬erently
in the ASR lexicon (and consequently in the transcripts).
Manually conforming the spelling in the topic to that found
in the lexicon would have increased the GAP for that topic
(with lemma) from 0.0026 to 0.1175.
Interestingly, it turned out that every term that we (man-
ually) judged to be highly discriminating in our analysis of
successful and unsuccessful topics was a named entity (NE).
This prompted us to perform a more systematic analysis of
the vocabulary coverage for the NEs present in all 29 topics.
If we leave out the NEs that are widespread in the collection
and thus useless for IR (Jew, Holocaust, Hitler, etc.), there
are 42 NEs in the topic set; only 13 of them are present in
the ASR lexicon, only 11 of those 13 actually appeared any-
where in the transcripts, and only 5 of those 11 substantially
contributed to successful IR (or, if we manually conform the
spelling in topic 3018, 6 of 12). The overall āquery rare
named entity error rateā for this collection is therefore (42-
5)/42=88%, more than double the overall word error rate.
Rare NEs are quite naturally not well represented in the
materials from which ASR systems are trained; integrating
phone-lattice term detection with large-vocabulary recogni-
tion oļ¬ers one promising research direction. Inconsistent
spelling is probably the more easily rectiļ¬ed problem; anno-
tators of ASR training materials are typically not domain
experts, and in some cases valid alternate transliterations
(e.g., from Yiddish roots) result in disagreement even among
experts. One useful approach would be to adjust the top-
ics to conform to the ASR lexicon, thus simulating a similar
process an interactive searcher could perform if notiļ¬ed that
one of their query terms is outside the known vocabulary.
4. NEXT STEPS
In addition to the ideas above for dealing with rare terms,
another obvious next step would be to optimize our system
design to better reļ¬ect the task characteristics that moti-
vated the design of the mean GAP measure. We have shown
that passage retrieval can indeed sometimes get us in the
right neighborhood, but overlapping passages may not be
the best way of identifying optimal replay start times. An-
other question that we need to explore is whether some other
retrieval model might be more eļ¬ective. Finally, extending
our work to include on the far larger CLEF 2007 Czech
news test collection will allow us to enrich our comparison
between lemmas and stems for Czech indexing.
5. ACKNOWLEDGMENTS
This work was supported in part by projects MSMT LC536,
GACR 1ET101470416 and NSF IIS-0122466.
6. REFERENCES
[1] J. HajicĢ. Disambiguation of Rich Inflection.
(Computational Morphology of Czech). Karolinum,
Prague, 2004.
[2] C. Murray et al. Leveraging Reusability: Cost-eļ¬ective
Lexical Acquisition for Large-scale Ontology
Translation. In Proceeding of ACL 2006, pages 945ā952,
Sydney, Australia, 2006.
[3] D. Oard et al. Overview of the CLEF-2006
Cross-Language Speech Retrieval Track. In CLEF 2006
- revised selected papers - Springer LNCS, 2007.
[4] S. Olsson, D. Oard, and J. HajicĢ. Cross-Language Text
Classiļ¬cation. In Proceedings of SIGIR 2005, pages
645ā646, Salvador, Brazil, 2005.
SIGIR 2007 Proceedings Poster
836