PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis

Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Influence of Repetitions on
Discourse and Semantic Analysis
• Costin-Gabriel Chiru • Ştefan Trăuşan-Matu

Purpose of the Thesis
• Design and develop tools that could be used for
analyzing discourse in both conversations and
monologues:
– To analyze the main directions in the field of discourse
analysis.
– To identify the main tools to be used in discourse analysis.
– To analyze the role of repetition in discourse.
– To investigate if a repetition-based perspective could be
useful to discourse analysis.
– To develop a theoretical framework that could be used to
analyze both kinds of text: conversations and monologues.
– To build a suite of applications that use the developed
theoretical framework.
125.11.2011
Influence of Repetitions on Discourse and Semantic
Analysis

Semantics & Meaning
• Semantics = “the study of how meaning is constructed, interpreted,
clarified, obscured, illustrated, simplified, negotiated, contradicted and
paraphrased” [SMEL].
• Frege (early 1890s): the meaning of a whole context is constructed from
the meaning of its constituent words, also considering the sentence
syntactic structure.
• The meaning of a word is determined by the company it keeps, by the
relations between that word and different linguistic units that are related
to it in a semantic network (network built using semantic means).
• Meaning-representations – formal structures that capture the meaning:
First Order Logic (FOL), Description Logics, Semantic Networks, Frame-
Based Systems, Ontologies.
• Available Resources: WordNet (WN), SentiWordNet, VerbNet (VN),
FrameNet (FN).
25.11.2011 2
Analysis

Discourse Analysis – Main Approaches
• Discourse = “a coherent structured group of sentences” [JuMa, 2009].
• Two types of discourses: conversations and monologues.
• Theories in Discourse Analysis:
– J. Hobbs’s Theory – considers a hierarchical organization of the discourse meaning,
starting from some semantic coherence relations that are identified in text. His
theory considers interpretation as abductive inferences in formal logic.
– Grosz et al.’s Theories – also considers that discourse meaning is hierarchical
organized, but starting from the idea of centrality and using two notions
(backward-looking center Cb and forward-looking center Cf) to provide the means of
linking the utterances.
– Rhetorical Structure Theory (RST) – developed by Mann and Thompson, suggests
that the hierarchical structure of discourse can be obtained using a set of rhetorical
relations (such as antithesis, elaboration etc.) to inter-relate the set of text-spans
from the discourse.
– Speech Acts Theory – introduced by Austin and elaborated by Searle, classifies the
discourse utterances according to the action they fulfill.
– Polyphony Theory – based on the idea that in text there are multiple voices that
influence each other.
325.11.2011
Analysis

Discourse Analysis – Problems of the
Existing Approaches (I)• In the first theories discourse meaning is seen as being incremental, but
discourse tends to be chaotic instead of well organized (topic drifts).
• Problems with Hobbs’s Theory:
– Requires very large DB for encoding the needed information – is domain-
dependent.
– Uses FOL  not everything can be represented, inference control (which rule
should be fired and in what order), very computational and time consuming.
– Uses weighted abduction  very complicated inference mechanisms, what
common cultural background the participants share.
– Biased towards understanding and establishing coherence in the text.
• Problems with Grosz et al.’s Theories:
– At a given moment, the focus of a discourse is only on a single topic and the switch
between topics is made in a smooth manner.
– The intention for a given segment is the intention of the participant who initiates
that segment.
– The locality of backward-looking center Cb - Cb for an utterance Un is chosen from the
set of forward-looking centers of the previous utterance.
– “Each sentence, S, has a single backward-looking center” [GrJW, 1983].
425.11.2011
Analysis

Discourse Analysis – Problems of the
Existing Approaches (II)
• Problems with Rhetorical Structure Theory (RST) :
– Had similar problems as the previous theories, but were solved by Potter
(2008) by relaxing the main assumptions of RST: the tree-like rhetorical
structure (graph in fact), uniqueness (one utterance could be involved in
multiple relations), and adjacency constraints (due to chaotic
conversation).
– Disregards the meaning of the words: “Although is better to drink an
airplane, the real solution is the basketball game that flavors politicians
to make high beds.” – has RST structure but is meaningless.
• Problems with Polyphony Theory:
– Has not been applied yet to monologues;
– Doesn’t provide a definition for what can be considered a voice and
therefore the existing frameworks consider them to be either a
participant in the conversation (inconsistence because the content
uttered by a participant is not coherent) or an utterance (that could
answer to multiple previous utterances, therefore being the echo of
different voices  incoherence).
525.11.2011
Analysis

Discourse Analysis – My Adaptation
• Starting from polyphonic framework, we adapted it to work for monologues also,
by considering ideas as voices.
• Considered that a voice is an idea, a concept that is rhythmically repeated in the
analyzed text  most of the texts are polyphonic, since usually in text there are
multiple ideas (voices), that flow in parallel, influencing each other, providing
inter-animation.
• Identification of voices: using repetitions (in a larger sense, that will be presented
in the next slides), since according to Brody (1994), a repetition is the echo of what
has been said and it provides a new context for the next uses of the repeated
concept, providing both unity and difference to the discourse (it enhances the
unity-difference axis of a discourse, which is very important for inter-animation).
• Following the repetition threads, one can see if the voices are flowing in parallel,
having a polyphonic text, or they only appear when the others are gone, simply
having multiple monophonic texts. Also, one can see if these voices are influencing
each other (providing inter-animation) or not.
625.11.2011
Analysis

Repetition
• Tannen (2009) has also noticed the importance of repetition in conversations,
saying that it has 4 different functions in discourse: production, comprehension,
connection and interaction and when all these functions are fulfilled, one can
observe the interpersonal involvement of participants and a coherent
conversation.
• Repetitions classification:
– Who makes the repetition: self-repetition and repetition by others;
– Scale of fixity in form: from exact repetition, to concept repetition with and to
paraphrase;
– Scale of temporality: from immediate repetition, to delayed (diachronic) repetition
within a discourse or across longer periods of time;
– Position of the repeated words in phrase: beginning, end, inverse order, etc.;
– Quantity of repeated information: from a phoneme to a whole sentence or to an idea;
– Intentionality: intentional (consciously used) or unintentional (can derive from
automaticity or different language or cognitive problems).
• We’re interested in the mechanisms that enforce and are able to capture the
inter-animation from a discourse: the unity-difference axis, corresponding to the
criteria: scale of fixity in form and quantity of repeated information.
725.11.2011
Analysis

Types of Repetitions (I)
• Lexical chains = “sequence of related words in the text” [CaSt, 2001].
– E.g.: London  City  Capital  UK  Europe
• Paronymy = words that have similar form, but different meanings.
– usually generated by mistake, lack of knowledge or by the desire to induce a specific rhythm
in the discourse.
• Collocations = “a sequence of two or more words, that has characteristics of a
syntactic and semantic unit, and whose exact and unambiguous meaning or
connotation cannot be derived directly from the meaning or connotation of its
components” [Chou, 1988].
– E.g.: strong tea, to make up, to kick the bucket
– Detected using statistics or by translating in a different language and analyzing if in the new
language the meaning is the same.
• N-grams = a probabilistic model that attempts to identify the next element from a
sequence of elements after we have encountered n-1 elements from it.
– Its parameters are computed from large corpora.
– Usually used in combination with decoding algorithms (such as Viterbi algorithm).
– E.g.: Google Corpus.
825.11.2011
Analysis

Repetition and Rhythmicity
• The repetition of words, phrases or longer syntactic units determines a
rhythmic pattern that provides musicality and allows for a softer flow of
the discussion – “repetition is rhetorical” [John, 1987].
• The rhythmicity analysis could provide information about identifying:
– the most important concepts presented in a discourse;
– the most important moments of a discourse;
– the combination of concepts that work together;
– the degree of the generality of the debated concepts;
– the artifacts that are built in the discourse and the concepts that they are
related to;
– the right participants (both in numbers and persons) to a conversation that
could ensure a successful collaboration;
– the right sense of a polysemous word.
925.11.2011

Applications (I) - Discourse Visualization and the
Identification of the Most Important Moments of it (I)
10
• Develop an application for visualizing the polyphonic analysis of any type of
discourse (conversation or monologue).
• Each “color” represents a voice (an idea) from the discourse.
• Inter-animation = areas where different voices meet  they are considered to be
the important moments of the discourse  analyzing different voices, one can see
where these important moments are placed in text and can investigate the file if
needed.
• Different types of important moments: pivotal moments, convergence moments,
singular moments, divergence moments and meeting points.
• Flexible application, since the user has the possibility to select what information
he/she wants to be shown.
• The inter-animation analysis also allows the identification of collocations, syntagms
and idioms and of missing links from the database used to build lexical chains.
25.11.2011
Analysis

Applications (I) - Discourse Visualization and the
Identification of the Most Important Moments of it (II)
11

Applications (II) - Repetition and Rhythmicity-Based
Assessment Model for Chat Conversations (I)
• We started from our adaptation of the Polyphonic framework, by voice understanding
either a participant or an idea and evaluated the quality of the whole conversation
from the point of view of participants’ involvement in the conversation and by the
effectiveness of the conversation from some given key-concepts points of view.
• We extracted a couple of information from the conversations (how interesting is the
conversation for the users, persistence of the users, explicit connections between the
users’ words, activity of a user, absence of a user, on topic, repetition, usefulness of a
user, topic rhythmicity) and based on them we established some criteria for
evaluating new conversations on the same topic and having the same number of
participants.
• Analyzing different models for different numbers of participants, we have shown that
the models that should be used to evaluate the chats are dependent on the number
of participants: they are different for small (4-5 participants) and medium (6-8
participants) teams, and we expect that these models to also be different for 2-3
participants and for more than 8 participants.
• We have also computed the correlation between the application and the domain
experts at both the validation and verification stages obtaining 0.8389 and
respectively 0.7933 which recommends it as a reliable application.
1225.11.2011
Analysis

Applications (II) - Repetition and Rhythmicity-Based
Assessment Model for Chat Conversations (II)
1
Good vs bad
conversation
Application’s
interface
Validation results
25.11.2011

Applications (III) - Malapropisms
Detection and Correction
• Develop an application for the detection and correction of malapropos words
(unintentional misuse of a word by confusion with another one).
• Voices represented by the important concepts from the text  at some points we
observe dissonances caused by the intervention of a different voice instead of the
voices that would fit in that place.
• Automatically identify these dissonance and solve them:
– Evaluate how probable a combination of words is to be dissonant (using a search engine and
different thresholds);
– See what the dissonant voice would sound like (inspecting the paronyms of the dissonant
voice);
– Replace the dissonant voice with the correct one if this is possible.
• Results for English: between 84% and 87% for malapropism detection, between
68% and 80% for malapropisms correction and around 0.5% rate of introducing
new malapropisms in texts.
• Preliminary tests for Romanian lead to good results, but longer processing time.
We expect that accuracy will be around 70%.
1425.11.2011
Analysis

Applications (IV) - Text
Recovery• Improve the quality of OCR, by guessing which are the missing words from the
digital form of the document, using a probabilistic method for text recovery and
the Google n-gram corpus.
• Reconstruction of damaged documents based on the prediction of the most
plausible word sets for filling the missing areas – gaps.
• Two types of voices: voices as concepts and voices as n-grams (since we needed to
also capture the functional words).
• Estimate the document model and then start from the gaps limit and fill the gap
using the most plausible words: preference to the words sets that respected the
document model, contained echoes of the existing voices and were part of more
frequent n-grams in our corpus (more powerful voices).
• The application didn’t achieve the expected results.
• N-grams: not very helpful – coverage rates:
– 5-grams: 15%, 4-grams: 30%, trigrams: 60%, bigrams: 90%
25.11.2011
Analysis

Contributions (I)
• We have made an analysis of the main methods for meaning-
representation used in NLP.
• We have analyzed the main theories from discourse analysis field.
• We have adapted a new theoretical framework, based on the polyphonic
theory, that is domain-independent and that can be used to analyze both
types of discourse, unlike previous approaches to discourse analysis.
• We have presented an analysis of repetitions in conversations, including
the main functions that they have and a classification of repetitions
according to multiple criteria.
• We have investigated the lexical chains building.
• We have developed a new method for building such lexical chains that
also included a disambiguating process.
1625.11.2011
Analysis

Contributions (II)
• We have described in detail the concepts of paronymy, collocations, and
n-grams.
• We have built two paronyms dictionary, one for English and one for
Romanian language.
• We have described how the rhythmicity of repetitions could be used in
discourse.
• We have designed and implemented an application for discourse
visualization that works on both conversations and monologues.
• We have proposed a classification of the important moments of a
discourse and a visual method for their identification in discourse. We
have also explained how this method could be used for tasks like
collocation identification or detection of missing links in the used lexical
database.
1725.11.2011
Analysis

Contributions (III)
• We have designed an application that could evaluate the quality of a conversation
from the point of view of participants’ involvement in that conversation and by
the effectiveness of the conversation with respect to some given key-concepts.
• We have built an application for malapropism detection and correction, that
worked very well for the English language.
• We have derived an adaptation of the above application in order to make it work
on Romanian language with very good initial results.
• We have described a framework similar to the one used for malapropism
detection and correction that could be used for metaphors’ identification.
• Starting from the Polyphony Theory, we have described a framework that could
help the reconstruction of damaged documents, defining another type of voice
that could be used in the polyphonic framework: the voice as patterns repetition.
25.11.2011 18
Analysis

Conclusions
• This thesis has addressed a couple of problems from the discourse
analysis domain and has proposed several novel solutions for problems
like discourse visualization, identification of discourse’ important
moments, the assessment of the contribution of the participants to a
CSCL conversation using a rhythmicity-based solution, detection and
correction of malapropisms, generation of text to fill in the gaps from
damaged documents.
• From the theoretical point of view:
– We have proposed a modification of the polyphonic framework that allows
the analysis of any type of discourse (conversation or monologue), starting
from the analysis of the existing approaches and tools from the field of
discourse analysis.
– We have presented the advantages and possible uses provided by the
rhythmicity analysis.
– We have introduced a new type of voice that could be used in the
polyphonic framework: the voice as patterns repetition.
– We have suggested a new method for discourse segmentation.
– We provided a classification of the important moments of a discourse.
1925.11.2011
Analysis

Publication list
1. Chiru, C., Trăuşan-Matu, Ş. (2008). Prelucrarea limbajului natural în interacţiunile chat (Romanian), In: Ştefan Trăuşan-Matu (Eds.), Interacţiunea conversaţională în sistemele colaborative pe Web, ISBN 978-
973-755-393-5, Publishing House Matrix Rom, Bucharest, pp. 117-138.
2. Chiru, C., Cojocaru, V., Rebedea, T., Trausan-Matu, S. (2010). Malapropisms Detection and Correction using a Paronyms Dictionary, a Search Engine and WordNet. ICSOFT 2010, vol 2, pp 364-373.
3. Chiru, C., Hanganu, A., Rebedea, T., Trausan-Matu, S. Filling the Gaps using Google 5-Grams corpus. ICSOFT 2010, vol 2, pp 438-443.
4. Chiru, C., Cojocaru, V., Trausan-Matu, S, Rebedea, T., Mihaila, D. (2011). Repetition and Rhythmicity Based Assessment for Chat Conversation. ISMIS 2011, LNAI 6804, Springer, 2011, pp 513-523.
5. Rebedea, T., Trăuşan-Matu, Ş. & Chiru, C. (2008). Extraction of Socio-semantic Data from Chat Conversations in Collaborative Learning Communities. In: Times of Convergence. Technologies Across Learning
Contexts, LNCS Vol. 5192, pp. 366-377, Springer, Berlin.
6. Rebedea, T., Trausan-Matu, S., Chiru, C. (2010). Automatic Feedback System for Collaborative Learning using Chats and Forums. CSEDU (1): 358-363.
7. Rebedea, T., Dascălu, M., Trăuşan-Matu, Ş., Banica, D., Gartner, A., Chiru C. & Mihaila, D. (2010) Overview and Preliminary Results of Using PolyCAFe for Collaboration Analysis and Feedback Generation. In:
Proceedings of ECTEL 2010, LNCS 6283, Springer, pp. 420-425.
8. Rebedea, T., Dascalu, M., Trausan-Matu, S., Armitt, G., & Chiru, C. (2011) Automatic Assessment of Collaborative Chat Conversations with PolyCAFe. ECTEL 2011 (accepted).
9. Scheau, C. Rebedea, T., Chiru, C., Trausan-Matu, S. (2010). Improving the relevance of search engine results by using semantic information from wikipedia. 9th RoEduNet International Conference, pp 151-156.
10. Trausan-Matu S., Posea V., Rebedea T., Chiru C. (2009). Using the Social Web to Supplement Classical Learning. In: Advances in Web Based Learning – ICWL 2009, LNCS 5686, pp. 386-389, Springer.
11. Chiru, C., Trăuşan-Matu, Ş., Rebedea, T. (2008). Algoritmi de generare de paronime pentru corectarea malapropismelor (Romanian). In: Revista Română de Interacţiune Om-Calculator 1 (Vol.1, Nr.1, 2008),
57-72.
12. Chiru, C., Trăuşan-Matu, Ş, Rebedea, T. (2008). O îmbunătăţire a performanţelor algoritmului KNN în sistemele de recomandare pe web. (Romanian). In: Revista Română de Interacţiune Om-Calculator, Vol. 1
(2008) Număr special: Interacţiune Om-Calculator 2008, 41-48.
13. Rebedea, T., Chiru, C., Trăuşan-Matu, Ş. (2008). Portal Web de stiri autonom bazat pe prelucrarea limbajului natural. (Romanian). In: Revista Română de Interacţiune Om-Calculator, Vol. 1 (2008) Număr
special: Interacţiune Om-Calculator 2008, 85-92.
14. Scheau, C. Rebedea, T., Chiru, C., Trausan-Matu, S. (2010). Îmbunătăţirea relevanţei rezultatelor motoarelor de căutare folosind informaţii semantice din Wikipedia. (Romanian). In: Revista Română de
Interacţiune Om-Calculator, Vol. 3 (2010) Număr special: Interacţiune Om-Calculator 2010, 85-90.
15. Chiru, C., Janca, A. & Rebedea, T. (2010). Disambiguation and Lexical Chains Construction using WordNet, S. Trausan-Matu, P.Dessus (Eds.) Natural Language Processing in Support of Learning: Metrics,
Feedback and Connectivity, MatrixRom, pp 65-71.
16. Chiru, C., Rebedea, T. & Ionita, M. (2010). Chat-Adapted POS Tagger for Romanian Language, S. Trausan-Matu, P.Dessus (Eds.) Natural Language Processing in Support of Learning: Metrics, Feedback and
Connectivity, MatrixRom, pp 90-96.
17. Trausan-Matu, S., Karatzas, K. & Chiru, C. (2007). Environmental Information Perception, Analysis and Communication with the Aid of Natural Language Processing. Proceedings of the 21st International
Conference on Informatics for Environmental Protection Environmental Informatics and Systems Research.
18. Trăuşan-Matu, Ş., Dessus, P., Lemaire, B., Mandin, S., Villiot-Leclercq, E., Rebedea, T., Chiru, C., Mihaila, D., Gartner, A., & Zampa, V. (2008). LTfLL - D5.1: Writing support and feedback design. [Online]
http://dspace.ou.nl/bitstream/1820/1700/1/LTfLL_Project_Deliverable_Report_5%201_Final4EC.pdf
19. Trăuşan-Matu, Ş., Dessus, P., Rebedea, T., Mandin, S., Villiot-Leclercq, E., Dascalu, M., Gartner, A., Chiru, C., Banica, D., Mihaila, D., Lemaire, B., Zampa, V., & Graziani, E. (2009). LTfLL - D5.2 Learning support
and feedback. LTfLL-project. [online] http://dspace.ou.nl/bitstream/1820/2251/1/LTfLL_Project_Deliverable_ReportD5%202-final%20EC.pdf
20. Trăuşan-Matu, Ş., Dessus, P., Rebedea, T., Loiseau, M., Dascalu, M., Mihaila, D., Braidman, I., Armitt, G., Smithies, A., Regan, M., Lemaire, B., Stahl, J., Villiot-Leclercq, E., Zampa, V., Chiru, C., Pasov, I., &
Dulceanu, A. (2010). D5.3 Support and feedback services version 1.5. [Online] http://dspace.ou.nl/bitstream/1820/2802/7/D5.3%20final%20EC.pdf
21. Chiru, C. (2007). Unsupervised Cohesion Based Text Segmentation. (2007). In: Proceedings of the EUROLAN 2007 Doctoral Consortium, ISBN 978-973-703-246-1, Publishing House of the “Alexandru Ioan
Cuza” University of Iasi, pp. 93-96.
22. Posea, V., Rebedea, T., Chiru, C., Trăuşan-Matu, Ş. (2012). Social Web Technologies to Enhance Teaching and Learning. In: UPB Scientific Bulletin, (to be published)
2025.11.2011
Analysis

Q&A
Thank you for your time!
2125.11.2011
Analysis

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis

Similar to PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis (20)

More from University Politehnica Bucharest

More from University Politehnica Bucharest (20)

Recently uploaded

Recently uploaded (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis

Editor's Notes