SlideShare a Scribd company logo
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Chat-Adapted POS Tagger for Romanian
Language
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Traian Rebedea
Mădălina Ioni ăț
Contents
• Introduction
• What is a POS tagger
• How does a POS tagger work
• Chat versus novel
• Chat model for POS tagger
• Results
• Conclusions and further developing
14.09.2010 1
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
Introduction
• Purpose: to build a Part-of-speech (POS) Tagger for
Romanian that can be used for tagging the words
from a special type of texts: chats.
• Methodology: for this task, we used the Hidden
Markov Models paradigm to “learn” a model of the
POS of different words from the chat. The tagging
has been done using Viterbi – a dynamic
programming algorithm.
2
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
What is a POS tagger
• “The act of assigning each word in a sentence a
tag that describes how that word is used in the
sentence. Typically, these tags indicate syntactic
categories, such as noun or verb, and
occasionally include additional feature
information, such as number (singular or plural)
and verb tense.” (Thede & Parker, 1999)
• Difficult task - many words are polysemous and
can be associated with multiple POS. (Eg: book)
3
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
How does a POS tagger work
• The POS tagger „learns” a model –
represented by some probabilities from an
annotated corpus;
– Best solution is HMM: M = (π, A, B,), where π
-initial state probabilities, A - the transition matrix
and B - the confusion matrix.
• Depending on these, the POS tagger decides
the most suitable sequence of tags for a new
text that has to be tagged.
– Best solution is decoding using Viterbi algorithm.
4
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
How does a POS tagger work (2)
• 3 steps:
– Tokenization: pre-processing - splitting the string
into tokens;
– Ambiguity look-up – uses a lexicon and a tag set to
assign each word a list of possible POS tags. The
words that are missing from the lexicon are
evaluated based on the transition matrix of the
HMM;
– Disambiguation - “choose” only one tag from the
possible tag set found for every word in order to
maximize the probability of the whole sequence.
5
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Chat versus Novels
• The POS taggers built until today (en: TnT tagger,
TreeTagger, Stanford Tagger, Qtag; ro:
http://www.cs.ubbcluj.ro/~dtatar/nlp/WebTagger/W
) obtained around 96-97% accuracy on regular
text.
• However, accuracy drops when applied to
different kind of texts – chats.
• For a good tagging, the model that is learnt has
to be similar to the one that is applied to.
6
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Chat versus Novels (2)
• How chats differ from regular text:
– the massive use of abbreviations: brb, cu, lol, gtg, afk;
– the use of emoticons: , , :s, ;), :P;
– punctuation marks are often not respected: i told
john i go to him after school i’m sorry i wasn’t able to
do it;
– capitalized letter at the beginning of a sentence or
name are rarely used: i, john;
– much more misspelled words are encountered;
– diacritics are very rarely used;
– utterance ≠ sentence;
– usage of foreign language words – especially English.
7
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Chat model for POS tagger
• The need to have an annotated corpus of chats
to be used as model for the chats.
• Start from an annotated corpus for novels – “1984”, by
George Orwell – having 154 tags taken from
http://www.racai.ro/books/awde/tufiscor.html to build
the new model in a semi-supervised manner:
• The model has been applied to tag a chat corpus (5 chats, ≈700
utterances, ≈ 15.000 words);
• The wrong tags have been manually corrected;
• The corpus of tagged chats has been used as the model for
chats annotation;
• The lexicon contained 2129 words.
8
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
23.07.2010 10
Results
• Testing has been made using another corpus of
chats (4 chats, ≈500 utterances, ≈ 12.000 words).
• Both models built using the chat model and the
model built based on the “1984” novel have been
tested.
• The results have shown an increase of precision
of 10.6% and of recall of 6.7% when using the
chat model instead of the novel model.
10
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Precision
11
Tag Novel Model
Precision (%)
Chat Model
Precision (%)
Improvement
(value and %)
Overall 66 73 7 (10.6 %)
Nouns 54 69 15 (27.7 %)
Numerals 66 69 3 (4.54 %)
Adjectives 59 53 -6 (-10.1 %)
Pronouns 66 67 1 (1.5 %)
Verbs 68 77 9 (13.2 %)
Abbreviations 35 81 46 (131.4 %)
Prepositions 85 82 -3 (-3.5 %)
Conjunctions 68 91 23 (33.8 %)
Prefixes 33 50 17 (51.5 %)
Articles 70 83 13 (18.5 %)
Adverbs 62 86 24 (38.7 %)
Interjections 56 89 33 (58.9 %)
Auxiliary 63 76 13 (20.6 %)
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Recall
12
Tag Novel Model
Recall (%)
Chat Model
Recall (%)
Improvement
(value and %)
Overall 59 63 4 (6.7 %)
Nouns 48 54 6 (12.5 %)
Numerals 66 43 -23 (-34.8 %)
Adjectives 48 56 8 (16.6 %)
Pronouns 61 63 2 (3.2 %)
Verbs 60 61 1 (1.6 %)
Abbreviations 26 58 32 (123 %)
Prepositions 82 90 8 (9.7 %)
Conjunctions 68 84 16 (23.5 %)
Prefixes 100 100 0 (0 %)
Articles 63 73 10 (15.8 %)
Adverbs 80 77 -3 (-3.7 %)
Interjections 66 34 -32 (-48.4 %)
Auxiliary 54 52 -2 (-3.7 %)
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Evolution of precision and recall
13
The evolution of precision and recall for each part
of speech
-100
-50
0
50
100
150
generally
nounnum
eral
adjectivepronoun
verb
abbreviation
preposition
conjunction
prefix
articleadverb
interjectionauxiliary
Precision
Recall
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Conclusions
• Abbreviation tags received the most significant increase of precision and
recall due to their intensive use compared with the novels.
• Improvement for nouns and verbs, which shows the use of this POS in the
chats to ease the information exchange;
• A significant improvement is registered for conjunctions because there
are more contacts between utterances than in a novel.
• The interjections have known a big progress which shows that the
participants use them very often in order to transmit certain signals.
• The parentheses have encountered an improvement because users tend
to use emoticons for expressing themselves. The other punctuation marks
have not encountered a significant improvement because they are not
used so much in chats.
14
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Possible Improvements
• Annotating more chats to be used as the chat
model.
• Using 2nd
order HMM instead of 1st
order like in
the actual form (trigrams instead of bigrams).
• Building a mapping from our tag set to the one
used by POS tagger built at the Babes-Bolyai
University in order to better evaluate the
performances.
15
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010
Q&A
Thank you for your time!
The Second Workshop on Natural Language Processing in
Support of Learning: Metrics, Feedback and Connectivity
14.09.2010

More Related Content

What's hot

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
Rajnish Raj
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
Jasline Presilda
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Yuki Tomo
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
iosrjce
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
iosrjce
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysis
ijnlc
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
kevig
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Association for Computational Linguistics
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
Mohammad Moslem Uddin
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
Marcis Pinnis
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
ijnlc
 
Natural Language Processing in Alternative and Augmentative Communication
Natural Language Processing in Alternative and Augmentative CommunicationNatural Language Processing in Alternative and Augmentative Communication
Natural Language Processing in Alternative and Augmentative Communication
Divya Sugumar
 
Technical Development Workshop - Text Analytics with Python
Technical Development Workshop - Text Analytics with PythonTechnical Development Workshop - Text Analytics with Python
Technical Development Workshop - Text Analytics with Python
Michelle Purnama
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
taeseon ryu
 

What's hot (17)

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysis
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Natural Language Processing in Alternative and Augmentative Communication
Natural Language Processing in Alternative and Augmentative CommunicationNatural Language Processing in Alternative and Augmentative Communication
Natural Language Processing in Alternative and Augmentative Communication
 
Technical Development Workshop - Text Analytics with Python
Technical Development Workshop - Text Analytics with PythonTechnical Development Workshop - Text Analytics with Python
Technical Development Workshop - Text Analytics with Python
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 

Similar to Chat adapted pos tagger for romanian language

Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
Aditya Kumar Khare
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Language
ijistjournal
 
Filling the gaps
Filling the gapsFilling the gaps
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
MedBelatrach
 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
IJECEIAES
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
unyil96
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
ijaia
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
gerogepatton
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Jinho Choi
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
milkesa13
 
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemEvaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
IJERA Editor
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
AtulKumarUpadhyay4
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Lviv Data Science Summer School
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
kevig
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
Balayogi G
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
JaeHo Jang
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET Journal
 

Similar to Chat adapted pos tagger for romanian language (20)

Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Language
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemEvaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
 
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptxEXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
EXPLORING NATURAL LANGUAGE PROCESSING (1).pptx
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
University Politehnica Bucharest
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
University Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
University Politehnica Bucharest
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
University Politehnica Bucharest
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
University Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
University Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
University Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
University Politehnica Bucharest
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
University Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
University Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
University Politehnica Bucharest
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
University Politehnica Bucharest
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
University Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
University Politehnica Bucharest
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
University Politehnica Bucharest
 
Metaphor detection
Metaphor detectionMetaphor detection

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 

Recently uploaded

Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 

Recently uploaded (20)

Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 

Chat adapted pos tagger for romanian language

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Chat-Adapted POS Tagger for Romanian Language Costin-Gabriel Chiru - costin.chiru@cs.pub.ro Traian Rebedea Mădălina Ioni ăț
  • 2. Contents • Introduction • What is a POS tagger • How does a POS tagger work • Chat versus novel • Chat model for POS tagger • Results • Conclusions and further developing 14.09.2010 1 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity
  • 3. Introduction • Purpose: to build a Part-of-speech (POS) Tagger for Romanian that can be used for tagging the words from a special type of texts: chats. • Methodology: for this task, we used the Hidden Markov Models paradigm to “learn” a model of the POS of different words from the chat. The tagging has been done using Viterbi – a dynamic programming algorithm. 2 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 4. What is a POS tagger • “The act of assigning each word in a sentence a tag that describes how that word is used in the sentence. Typically, these tags indicate syntactic categories, such as noun or verb, and occasionally include additional feature information, such as number (singular or plural) and verb tense.” (Thede & Parker, 1999) • Difficult task - many words are polysemous and can be associated with multiple POS. (Eg: book) 3 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 5. How does a POS tagger work • The POS tagger „learns” a model – represented by some probabilities from an annotated corpus; – Best solution is HMM: M = (π, A, B,), where π -initial state probabilities, A - the transition matrix and B - the confusion matrix. • Depending on these, the POS tagger decides the most suitable sequence of tags for a new text that has to be tagged. – Best solution is decoding using Viterbi algorithm. 4 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 6. How does a POS tagger work (2) • 3 steps: – Tokenization: pre-processing - splitting the string into tokens; – Ambiguity look-up – uses a lexicon and a tag set to assign each word a list of possible POS tags. The words that are missing from the lexicon are evaluated based on the transition matrix of the HMM; – Disambiguation - “choose” only one tag from the possible tag set found for every word in order to maximize the probability of the whole sequence. 5 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 7. Chat versus Novels • The POS taggers built until today (en: TnT tagger, TreeTagger, Stanford Tagger, Qtag; ro: http://www.cs.ubbcluj.ro/~dtatar/nlp/WebTagger/W ) obtained around 96-97% accuracy on regular text. • However, accuracy drops when applied to different kind of texts – chats. • For a good tagging, the model that is learnt has to be similar to the one that is applied to. 6 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 8. Chat versus Novels (2) • How chats differ from regular text: – the massive use of abbreviations: brb, cu, lol, gtg, afk; – the use of emoticons: , , :s, ;), :P; – punctuation marks are often not respected: i told john i go to him after school i’m sorry i wasn’t able to do it; – capitalized letter at the beginning of a sentence or name are rarely used: i, john; – much more misspelled words are encountered; – diacritics are very rarely used; – utterance ≠ sentence; – usage of foreign language words – especially English. 7 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 9. Chat model for POS tagger • The need to have an annotated corpus of chats to be used as model for the chats. • Start from an annotated corpus for novels – “1984”, by George Orwell – having 154 tags taken from http://www.racai.ro/books/awde/tufiscor.html to build the new model in a semi-supervised manner: • The model has been applied to tag a chat corpus (5 chats, ≈700 utterances, ≈ 15.000 words); • The wrong tags have been manually corrected; • The corpus of tagged chats has been used as the model for chats annotation; • The lexicon contained 2129 words. 8 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 11. Results • Testing has been made using another corpus of chats (4 chats, ≈500 utterances, ≈ 12.000 words). • Both models built using the chat model and the model built based on the “1984” novel have been tested. • The results have shown an increase of precision of 10.6% and of recall of 6.7% when using the chat model instead of the novel model. 10 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 12. Precision 11 Tag Novel Model Precision (%) Chat Model Precision (%) Improvement (value and %) Overall 66 73 7 (10.6 %) Nouns 54 69 15 (27.7 %) Numerals 66 69 3 (4.54 %) Adjectives 59 53 -6 (-10.1 %) Pronouns 66 67 1 (1.5 %) Verbs 68 77 9 (13.2 %) Abbreviations 35 81 46 (131.4 %) Prepositions 85 82 -3 (-3.5 %) Conjunctions 68 91 23 (33.8 %) Prefixes 33 50 17 (51.5 %) Articles 70 83 13 (18.5 %) Adverbs 62 86 24 (38.7 %) Interjections 56 89 33 (58.9 %) Auxiliary 63 76 13 (20.6 %) The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 13. Recall 12 Tag Novel Model Recall (%) Chat Model Recall (%) Improvement (value and %) Overall 59 63 4 (6.7 %) Nouns 48 54 6 (12.5 %) Numerals 66 43 -23 (-34.8 %) Adjectives 48 56 8 (16.6 %) Pronouns 61 63 2 (3.2 %) Verbs 60 61 1 (1.6 %) Abbreviations 26 58 32 (123 %) Prepositions 82 90 8 (9.7 %) Conjunctions 68 84 16 (23.5 %) Prefixes 100 100 0 (0 %) Articles 63 73 10 (15.8 %) Adverbs 80 77 -3 (-3.7 %) Interjections 66 34 -32 (-48.4 %) Auxiliary 54 52 -2 (-3.7 %) The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 14. Evolution of precision and recall 13 The evolution of precision and recall for each part of speech -100 -50 0 50 100 150 generally nounnum eral adjectivepronoun verb abbreviation preposition conjunction prefix articleadverb interjectionauxiliary Precision Recall The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 15. Conclusions • Abbreviation tags received the most significant increase of precision and recall due to their intensive use compared with the novels. • Improvement for nouns and verbs, which shows the use of this POS in the chats to ease the information exchange; • A significant improvement is registered for conjunctions because there are more contacts between utterances than in a novel. • The interjections have known a big progress which shows that the participants use them very often in order to transmit certain signals. • The parentheses have encountered an improvement because users tend to use emoticons for expressing themselves. The other punctuation marks have not encountered a significant improvement because they are not used so much in chats. 14 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 16. Possible Improvements • Annotating more chats to be used as the chat model. • Using 2nd order HMM instead of 1st order like in the actual form (trigrams instead of bigrams). • Building a mapping from our tag set to the one used by POS tagger built at the Babes-Bolyai University in order to better evaluate the performances. 15 The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010
  • 17. Q&A Thank you for your time! The Second Workshop on Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity 14.09.2010

Editor's Notes

  1. Ex: book
  2. π represents the initial state probabilities; A (the transition matrix) - shows the probability of having tag j after tag i; B (the confusion matrix) - the probability that word wi has tag tj.