Azure Monitor & Application Insight to monitor Infrastructure & Application
Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets
1. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Language Twitter: Part-of-Speech
Tagging and Analysis of Irish Tweets
Teresa Lynn1,2, Kevin Scannell3, and Eimear Maguire1
1ADAPT Centre, School of Computing, Dublin City University, Ireland
2Department of Computing, Macquarie University, Sydney, Australia
3Department of Mathematics and Computer Science, St. Louis University, USA
31st July 2015
1 / 27
2. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
2 / 27
3. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
3 / 27
4. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
5. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
6. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
7. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
The Irish Language
Official and National Language of Ireland
Official EU Language
UNESCO-listed endangered language
Celtic language (Indo-European)
VSO word order and morphologically rich
5 / 27
8. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
The Irish Language
Official and National Language of Ireland
Official EU Language
UNESCO-listed endangered language
Celtic language (Indo-European)
VSO word order and morphologically rich
5 / 27
9. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Language use in Ireland
Figures from 2011 Census: Irish spoken daily outside education system
Image source Wikipedia.
6 / 27
10. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Status of Irish Language Technology
Text analysis: state of LT support for 30 EU languages
Source: META-NET:“The Irish Language in the Digital Age” (Judge et. al, 2012)
7 / 27
11. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
8 / 27
12. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
13. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
14. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
15. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
16. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
17. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
18. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
19. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
20. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
21. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
22. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
23. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
24. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
25. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
26. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
27. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
28. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
12 / 27
29. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Goals
Build a corpus of POS-tagged Irish tweets
Train a statistical POS tagger for Irish tweets
Assess how we can leverage existing resources
Examine the impact of noisy UG text on existing resources
13 / 27
30. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
New Irish Twitter POS tagset
(inspired by Gimpel et. al (2011))
Tag Description (PAROLE TAGS)
N common noun
(Noun, Pron Ref, Subst)
∧ proper noun
(Prop Noun)
O pronoun (Pron Pers, Pron Idf,
Pron Q, Pron Dem)
VN verbal noun (Verbal Noun)
V verb (Cop, Verb*)
A adjective (Adj, Verbal Adj,
Prop Adj)
R adverb (Adv*)
D determiner (Art, Det)
P preposition, prep. pronoun
(Prep*, Pron Prep)
T particle (Part*)
, punctuation (Punct)
& conjunction (Conj Coord,
Conj Subord)
$ numeral, quantifier (Num)
! interjection (Itj)
G foreign words, abbreviations, item
(Foreign, Abr, Item, Unknown)
~ discourse marker
# hashtag
#MWE multi-word hashtag
@ at-mention
E emoticon
U URL/email address/XML (Web)
14 / 27
31. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
32. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
33. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
34. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
35. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
36. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
37. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
38. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
39. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
16 / 27
40. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
41. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
42. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
43. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
44. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
45. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
46. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Results
Training Data Dev Test
Baseline
Rule-Based Tagger 85.07 83.51
Morfette
BaseMorf 86.77 88.67
NormMorf 87.94 88.74
BaseMorf+Dict 87.50 89.27
NormMorf+Dict 88.47 90.22
ARK
BaseArkForm 88.39 89.92
ArkForm#@ 89.36 90.94
ArkForm#URL@ 89.32 91.02
BaseArkLemma#URL 90.74 91.62
ArkLemma#URL@ 91.46 91.89
Stanford
BestStanForm 82.36 84.08
BestStanLemma 87.34 88.36
Bootstrapping Best Model
ArkLemma#URL@+NCII 92.60 93.02
Table: Results of evaluation of POS-taggers on new Irish Twitter corpus
19 / 27
47. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
20 / 27
48. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
Our Highest Scores 91.46 (Dev) 91.89 (Test)
Comparison with English tweet POS tagging scores
(Gimpel et el., 2011) 88.67 (Dev) 89.37 (Test)
1827 tweets. 17 annotators.
tag dictionary (based on PTB)
distributional similarity features
phonetic normalisation (Metaphone)
(Owoputi et al. 2013) 93.2
tag dictionary
also used word clustering on 56 million tweets
21 / 27
49. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
Our limited resources
No frequently-capitalised token features
No distributional similarity features
No phonetic normalisation
No unsupervised clustering
2 annotators
22 / 27
50. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
And...
inter- and intra-sentential code-switching
OOVs are English stop words (‘to’, ‘on’, ‘for’) and unmarked
diacritics (compared to English tweets where the most frequent OOVs are text-speak)
Possible reason for high scores
tag used for English tokens also used for abbreviations, items
and unknowns
more care taken by Irish tweeters?
23 / 27
51. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Conclusion
Our contribution to Irish language NLP
new POS tagset for tweets
gold corpus of 1537 POS-tagged Irish tweets
statistical POS tagging models
first computational analysis of Irish social media text
24 / 27
52. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Future Work
sociolinguistic studies
domain-adapted parsing
further POS-tagging with new tag for English tokens
cross-lingual analysis
25 / 27
53. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Conclusion
#GRMA
Go Raibh Maith Agaibh!
(Thank You!)
26 / 27
54. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
New Tags
Verbal Nouns
infinitive phrases (INF-PART + VN)
progressive aspectual phrases (PREP + VN)
Multiword Hashtags
future work on parsing
Particles
relative, surname, infinitive, numeric, comparative, vocative,
adverbial
27 / 27