SlideShare a Scribd company logo
1 of 54
Download to read offline
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Language Twitter: Part-of-Speech
Tagging and Analysis of Irish Tweets
Teresa Lynn1,2, Kevin Scannell3, and Eimear Maguire1
1ADAPT Centre, School of Computing, Dublin City University, Ireland
2Department of Computing, Macquarie University, Sydney, Australia
3Department of Mathematics and Computer Science, St. Louis University, USA
31st July 2015
1 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
2 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
3 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Minority Languages in the Digital Age
Accessing information in the Digital Age
print resources: newspapers, magazines, books
digital resources: internet, news sites, social media, blogs,
educational software
Printing press → extinction of many minority languages
Digital Age → extinction of minority languages?
4 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
The Irish Language
Official and National Language of Ireland
Official EU Language
UNESCO-listed endangered language
Celtic language (Indo-European)
VSO word order and morphologically rich
5 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
The Irish Language
Official and National Language of Ireland
Official EU Language
UNESCO-listed endangered language
Celtic language (Indo-European)
VSO word order and morphologically rich
5 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Language use in Ireland
Figures from 2011 Census: Irish spoken daily outside education system
Image source Wikipedia.
6 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Status of Irish Language Technology
Text analysis: state of LT support for 30 EU languages
Source: META-NET:“The Irish Language in the Digital Age” (Judge et. al, 2012)
7 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
8 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Jumping on the Social Media Bandwagon
Resurgence amongst younger generation
Facebook
Blogs/ Forums
Twitter
9 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Twitter
According to the Indigenous Tweets project:
over 1 million tweets since Twitter’s launch in 2006
over 8,000 Irish language tweeters
type of users: media, native speakers, language enthusiasts,
government bodies
main topics: sports, Irish language promotion, TV,
community/ public events, news items
www.indigenoustweets.com
10 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Irish Tweets
Freezing i dTra Li,Ciarrai chun cinn le cuilin.
Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın.
‘Freezing in Tralee, Kerry (is) ahead by a point.’
Figure: Example of noisy Irish tweet
Code-switching
Diacritics
Verb drop
Spacing
Phonetic spelling
gowil (go bhfuil) ‘that is’
Abbreviations
grma → go raibh maith agat ‘thank you’
t7ain → tseachtain ‘week’
11 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
12 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Goals
Build a corpus of POS-tagged Irish tweets
Train a statistical POS tagger for Irish tweets
Assess how we can leverage existing resources
Examine the impact of noisy UG text on existing resources
13 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
New Irish Twitter POS tagset
(inspired by Gimpel et. al (2011))
Tag Description (PAROLE TAGS)
N common noun
(Noun, Pron Ref, Subst)
∧ proper noun
(Prop Noun)
O pronoun (Pron Pers, Pron Idf,
Pron Q, Pron Dem)
VN verbal noun (Verbal Noun)
V verb (Cop, Verb*)
A adjective (Adj, Verbal Adj,
Prop Adj)
R adverb (Adv*)
D determiner (Art, Det)
P preposition, prep. pronoun
(Prep*, Pron Prep)
T particle (Part*)
, punctuation (Punct)
& conjunction (Conj Coord,
Conj Subord)
$ numeral, quantifier (Num)
! interjection (Itj)
G foreign words, abbreviations, item
(Foreign, Abr, Item, Unknown)
~ discourse marker
# hashtag
#MWE multi-word hashtag
@ at-mention
E emoticon
U URL/email address/XML (Web)
14 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Building a POS-tagged corpus
new POS tagset (inspired by Gimpel et. al (2011))
1550 random tweets (from set of 950,000 Irish tweets)
tokenised using twokenise (Owoputi et al., (2013)
post-processing
rejoined multi-word units (e.g. go dt´ı ‘to’)
split tokens with contractions
(e.g. b’fh´eidir → b’ fh´eidir ‘maybe’)
pre-tagged with a rule-based tagger (using PAROLE tags)
mapped to new Irish Twitter POS tagset
hand corrected tags (and lemmas)
15 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
16 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Data setup
Removal of 100%-English text tweets
→1537 tweets
Test set: 148 tweets
Dev set: 147 tweets
Training: 1242 tweets
Bootstrapping data:
Gold standard 3198-sentence POS-tagged corpus
grammatical, well-structured
no social media-related tokens (e.g. hashtags, emoticons)
17 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Experiment Setup
Train 3 POS taggers
Morfette (Chrupala et. al, 2008)
uses lemma information
tackles data sparsity
ARK (Owoputi et al., 2013)
developed initially for English tweets
no simple option to include lemma
ran separate form only vs lemma only experiments
Stanford Tagger (Toutanova et al., 2003)
ran separate form only vs lemma only experiments
18 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Results
Training Data Dev Test
Baseline
Rule-Based Tagger 85.07 83.51
Morfette
BaseMorf 86.77 88.67
NormMorf 87.94 88.74
BaseMorf+Dict 87.50 89.27
NormMorf+Dict 88.47 90.22
ARK
BaseArkForm 88.39 89.92
ArkForm#@ 89.36 90.94
ArkForm#URL@ 89.32 91.02
BaseArkLemma#URL 90.74 91.62
ArkLemma#URL@ 91.46 91.89
Stanford
BestStanForm 82.36 84.08
BestStanLemma 87.34 88.36
Bootstrapping Best Model
ArkLemma#URL@+NCII 92.60 93.02
Table: Results of evaluation of POS-taggers on new Irish Twitter corpus
19 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Outline
Introduction
Irish Language in Social Media
Irish Twitter POS tagging
Experiments
Analysis and Conclusion
20 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
Our Highest Scores 91.46 (Dev) 91.89 (Test)
Comparison with English tweet POS tagging scores
(Gimpel et el., 2011) 88.67 (Dev) 89.37 (Test)
1827 tweets. 17 annotators.
tag dictionary (based on PTB)
distributional similarity features
phonetic normalisation (Metaphone)
(Owoputi et al. 2013) 93.2
tag dictionary
also used word clustering on 56 million tweets
21 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
Our limited resources
No frequently-capitalised token features
No distributional similarity features
No phonetic normalisation
No unsupervised clustering
2 annotators
22 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Analysis
And...
inter- and intra-sentential code-switching
OOVs are English stop words (‘to’, ‘on’, ‘for’) and unmarked
diacritics (compared to English tweets where the most frequent OOVs are text-speak)
Possible reason for high scores
tag used for English tokens also used for abbreviations, items
and unknowns
more care taken by Irish tweeters?
23 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Conclusion
Our contribution to Irish language NLP
new POS tagset for tweets
gold corpus of 1537 POS-tagged Irish tweets
statistical POS tagging models
first computational analysis of Irish social media text
24 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Future Work
sociolinguistic studies
domain-adapted parsing
further POS-tagging with new tag for English tokens
cross-lingual analysis
25 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
Conclusion
#GRMA
Go Raibh Maith Agaibh!
(Thank You!)
26 / 27
Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion
New Tags
Verbal Nouns
infinitive phrases (INF-PART + VN)
progressive aspectual phrases (PREP + VN)
Multiword Hashtags
future work on parsing
Particles
relative, surname, infinitive, numeric, comparative, vocative,
adverbial
27 / 27

More Related Content

Similar to Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets

ORCID: Connecting Research and Researchers. Author: Michael Ladisch
ORCID: Connecting Research and Researchers. Author: Michael LadischORCID: Connecting Research and Researchers. Author: Michael Ladisch
ORCID: Connecting Research and Researchers. Author: Michael LadischUCD Library
 
Can computers understand time?
Can computers understand time?Can computers understand time?
Can computers understand time?Michele Filannino
 
Summary of 2015 British-Irish Council IML languages conference.
Summary of 2015 British-Irish Council IML languages conference.Summary of 2015 British-Irish Council IML languages conference.
Summary of 2015 British-Irish Council IML languages conference.Teresa Lynn
 
iTunes for K-6 Teachers
iTunes for K-6 TeachersiTunes for K-6 Teachers
iTunes for K-6 TeachersAnn Noonen
 
Importance of english
Importance of englishImportance of english
Importance of englishcetis 47
 
Diving into Digital: Small Steps and Big Returns in Digital Archiving
Diving into Digital: Small Steps and Big Returns in Digital ArchivingDiving into Digital: Small Steps and Big Returns in Digital Archiving
Diving into Digital: Small Steps and Big Returns in Digital Archivingdri_ireland
 
Essay About Information Technology. Information Technology Essay - PHDessay.com
Essay About Information Technology. Information Technology Essay - PHDessay.comEssay About Information Technology. Information Technology Essay - PHDessay.com
Essay About Information Technology. Information Technology Essay - PHDessay.comMegan Wilson
 
Language of the 2000s
Language of the 2000sLanguage of the 2000s
Language of the 2000schelseaharper
 

Similar to Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets (10)

ORCID: Connecting Research and Researchers. Author: Michael Ladisch
ORCID: Connecting Research and Researchers. Author: Michael LadischORCID: Connecting Research and Researchers. Author: Michael Ladisch
ORCID: Connecting Research and Researchers. Author: Michael Ladisch
 
Can computers understand time?
Can computers understand time?Can computers understand time?
Can computers understand time?
 
Summary of 2015 British-Irish Council IML languages conference.
Summary of 2015 British-Irish Council IML languages conference.Summary of 2015 British-Irish Council IML languages conference.
Summary of 2015 British-Irish Council IML languages conference.
 
iTunes for K-6 Teachers
iTunes for K-6 TeachersiTunes for K-6 Teachers
iTunes for K-6 Teachers
 
Importance of english
Importance of englishImportance of english
Importance of english
 
Diving into Digital: Small Steps and Big Returns in Digital Archiving
Diving into Digital: Small Steps and Big Returns in Digital ArchivingDiving into Digital: Small Steps and Big Returns in Digital Archiving
Diving into Digital: Small Steps and Big Returns in Digital Archiving
 
Essay About Information Technology. Information Technology Essay - PHDessay.com
Essay About Information Technology. Information Technology Essay - PHDessay.comEssay About Information Technology. Information Technology Essay - PHDessay.com
Essay About Information Technology. Information Technology Essay - PHDessay.com
 
Language of the 2000s
Language of the 2000sLanguage of the 2000s
Language of the 2000s
 
Research on Education
Research on  EducationResearch on  Education
Research on Education
 
Translating Databased Meaning
Translating Databased MeaningTranslating Databased Meaning
Translating Databased Meaning
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 

Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets

  • 1. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets Teresa Lynn1,2, Kevin Scannell3, and Eimear Maguire1 1ADAPT Centre, School of Computing, Dublin City University, Ireland 2Department of Computing, Macquarie University, Sydney, Australia 3Department of Mathematics and Computer Science, St. Louis University, USA 31st July 2015 1 / 27
  • 2. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 2 / 27
  • 3. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 3 / 27
  • 4. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Minority Languages in the Digital Age Accessing information in the Digital Age print resources: newspapers, magazines, books digital resources: internet, news sites, social media, blogs, educational software Printing press → extinction of many minority languages Digital Age → extinction of minority languages? 4 / 27
  • 5. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Minority Languages in the Digital Age Accessing information in the Digital Age print resources: newspapers, magazines, books digital resources: internet, news sites, social media, blogs, educational software Printing press → extinction of many minority languages Digital Age → extinction of minority languages? 4 / 27
  • 6. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Minority Languages in the Digital Age Accessing information in the Digital Age print resources: newspapers, magazines, books digital resources: internet, news sites, social media, blogs, educational software Printing press → extinction of many minority languages Digital Age → extinction of minority languages? 4 / 27
  • 7. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion The Irish Language Official and National Language of Ireland Official EU Language UNESCO-listed endangered language Celtic language (Indo-European) VSO word order and morphologically rich 5 / 27
  • 8. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion The Irish Language Official and National Language of Ireland Official EU Language UNESCO-listed endangered language Celtic language (Indo-European) VSO word order and morphologically rich 5 / 27
  • 9. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Language use in Ireland Figures from 2011 Census: Irish spoken daily outside education system Image source Wikipedia. 6 / 27
  • 10. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Status of Irish Language Technology Text analysis: state of LT support for 30 EU languages Source: META-NET:“The Irish Language in the Digital Age” (Judge et. al, 2012) 7 / 27
  • 11. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 8 / 27
  • 12. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Jumping on the Social Media Bandwagon Resurgence amongst younger generation Facebook Blogs/ Forums Twitter 9 / 27
  • 13. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Jumping on the Social Media Bandwagon Resurgence amongst younger generation Facebook Blogs/ Forums Twitter 9 / 27
  • 14. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Jumping on the Social Media Bandwagon Resurgence amongst younger generation Facebook Blogs/ Forums Twitter 9 / 27
  • 15. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Jumping on the Social Media Bandwagon Resurgence amongst younger generation Facebook Blogs/ Forums Twitter 9 / 27
  • 16. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Jumping on the Social Media Bandwagon Resurgence amongst younger generation Facebook Blogs/ Forums Twitter 9 / 27
  • 17. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Twitter According to the Indigenous Tweets project: over 1 million tweets since Twitter’s launch in 2006 over 8,000 Irish language tweeters type of users: media, native speakers, language enthusiasts, government bodies main topics: sports, Irish language promotion, TV, community/ public events, news items www.indigenoustweets.com 10 / 27
  • 18. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Twitter According to the Indigenous Tweets project: over 1 million tweets since Twitter’s launch in 2006 over 8,000 Irish language tweeters type of users: media, native speakers, language enthusiasts, government bodies main topics: sports, Irish language promotion, TV, community/ public events, news items www.indigenoustweets.com 10 / 27
  • 19. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Twitter According to the Indigenous Tweets project: over 1 million tweets since Twitter’s launch in 2006 over 8,000 Irish language tweeters type of users: media, native speakers, language enthusiasts, government bodies main topics: sports, Irish language promotion, TV, community/ public events, news items www.indigenoustweets.com 10 / 27
  • 20. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Twitter According to the Indigenous Tweets project: over 1 million tweets since Twitter’s launch in 2006 over 8,000 Irish language tweeters type of users: media, native speakers, language enthusiasts, government bodies main topics: sports, Irish language promotion, TV, community/ public events, news items www.indigenoustweets.com 10 / 27
  • 21. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 22. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 23. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 24. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 25. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 26. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 27. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Irish Tweets Freezing i dTra Li,Ciarrai chun cinn le cuilin. Freezing i dTr´a L´ı, (t´a) Ciarra´ı chun cinn le c´uil´ın. ‘Freezing in Tralee, Kerry (is) ahead by a point.’ Figure: Example of noisy Irish tweet Code-switching Diacritics Verb drop Spacing Phonetic spelling gowil (go bhfuil) ‘that is’ Abbreviations grma → go raibh maith agat ‘thank you’ t7ain → tseachtain ‘week’ 11 / 27
  • 28. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 12 / 27
  • 29. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Goals Build a corpus of POS-tagged Irish tweets Train a statistical POS tagger for Irish tweets Assess how we can leverage existing resources Examine the impact of noisy UG text on existing resources 13 / 27
  • 30. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion New Irish Twitter POS tagset (inspired by Gimpel et. al (2011)) Tag Description (PAROLE TAGS) N common noun (Noun, Pron Ref, Subst) ∧ proper noun (Prop Noun) O pronoun (Pron Pers, Pron Idf, Pron Q, Pron Dem) VN verbal noun (Verbal Noun) V verb (Cop, Verb*) A adjective (Adj, Verbal Adj, Prop Adj) R adverb (Adv*) D determiner (Art, Det) P preposition, prep. pronoun (Prep*, Pron Prep) T particle (Part*) , punctuation (Punct) & conjunction (Conj Coord, Conj Subord) $ numeral, quantifier (Num) ! interjection (Itj) G foreign words, abbreviations, item (Foreign, Abr, Item, Unknown) ~ discourse marker # hashtag #MWE multi-word hashtag @ at-mention E emoticon U URL/email address/XML (Web) 14 / 27
  • 31. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 32. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 33. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 34. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 35. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 36. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 37. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 38. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Building a POS-tagged corpus new POS tagset (inspired by Gimpel et. al (2011)) 1550 random tweets (from set of 950,000 Irish tweets) tokenised using twokenise (Owoputi et al., (2013) post-processing rejoined multi-word units (e.g. go dt´ı ‘to’) split tokens with contractions (e.g. b’fh´eidir → b’ fh´eidir ‘maybe’) pre-tagged with a rule-based tagger (using PAROLE tags) mapped to new Irish Twitter POS tagset hand corrected tags (and lemmas) 15 / 27
  • 39. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 16 / 27
  • 40. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Data setup Removal of 100%-English text tweets →1537 tweets Test set: 148 tweets Dev set: 147 tweets Training: 1242 tweets Bootstrapping data: Gold standard 3198-sentence POS-tagged corpus grammatical, well-structured no social media-related tokens (e.g. hashtags, emoticons) 17 / 27
  • 41. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Data setup Removal of 100%-English text tweets →1537 tweets Test set: 148 tweets Dev set: 147 tweets Training: 1242 tweets Bootstrapping data: Gold standard 3198-sentence POS-tagged corpus grammatical, well-structured no social media-related tokens (e.g. hashtags, emoticons) 17 / 27
  • 42. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Data setup Removal of 100%-English text tweets →1537 tweets Test set: 148 tweets Dev set: 147 tweets Training: 1242 tweets Bootstrapping data: Gold standard 3198-sentence POS-tagged corpus grammatical, well-structured no social media-related tokens (e.g. hashtags, emoticons) 17 / 27
  • 43. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Train 3 POS taggers Morfette (Chrupala et. al, 2008) uses lemma information tackles data sparsity ARK (Owoputi et al., 2013) developed initially for English tweets no simple option to include lemma ran separate form only vs lemma only experiments Stanford Tagger (Toutanova et al., 2003) ran separate form only vs lemma only experiments 18 / 27
  • 44. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Train 3 POS taggers Morfette (Chrupala et. al, 2008) uses lemma information tackles data sparsity ARK (Owoputi et al., 2013) developed initially for English tweets no simple option to include lemma ran separate form only vs lemma only experiments Stanford Tagger (Toutanova et al., 2003) ran separate form only vs lemma only experiments 18 / 27
  • 45. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Experiment Setup Train 3 POS taggers Morfette (Chrupala et. al, 2008) uses lemma information tackles data sparsity ARK (Owoputi et al., 2013) developed initially for English tweets no simple option to include lemma ran separate form only vs lemma only experiments Stanford Tagger (Toutanova et al., 2003) ran separate form only vs lemma only experiments 18 / 27
  • 46. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Results Training Data Dev Test Baseline Rule-Based Tagger 85.07 83.51 Morfette BaseMorf 86.77 88.67 NormMorf 87.94 88.74 BaseMorf+Dict 87.50 89.27 NormMorf+Dict 88.47 90.22 ARK BaseArkForm 88.39 89.92 ArkForm#@ 89.36 90.94 ArkForm#URL@ 89.32 91.02 BaseArkLemma#URL 90.74 91.62 ArkLemma#URL@ 91.46 91.89 Stanford BestStanForm 82.36 84.08 BestStanLemma 87.34 88.36 Bootstrapping Best Model ArkLemma#URL@+NCII 92.60 93.02 Table: Results of evaluation of POS-taggers on new Irish Twitter corpus 19 / 27
  • 47. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Outline Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion 20 / 27
  • 48. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Analysis Our Highest Scores 91.46 (Dev) 91.89 (Test) Comparison with English tweet POS tagging scores (Gimpel et el., 2011) 88.67 (Dev) 89.37 (Test) 1827 tweets. 17 annotators. tag dictionary (based on PTB) distributional similarity features phonetic normalisation (Metaphone) (Owoputi et al. 2013) 93.2 tag dictionary also used word clustering on 56 million tweets 21 / 27
  • 49. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Analysis Our limited resources No frequently-capitalised token features No distributional similarity features No phonetic normalisation No unsupervised clustering 2 annotators 22 / 27
  • 50. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Analysis And... inter- and intra-sentential code-switching OOVs are English stop words (‘to’, ‘on’, ‘for’) and unmarked diacritics (compared to English tweets where the most frequent OOVs are text-speak) Possible reason for high scores tag used for English tokens also used for abbreviations, items and unknowns more care taken by Irish tweeters? 23 / 27
  • 51. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Conclusion Our contribution to Irish language NLP new POS tagset for tweets gold corpus of 1537 POS-tagged Irish tweets statistical POS tagging models first computational analysis of Irish social media text 24 / 27
  • 52. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Future Work sociolinguistic studies domain-adapted parsing further POS-tagging with new tag for English tokens cross-lingual analysis 25 / 27
  • 53. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion Conclusion #GRMA Go Raibh Maith Agaibh! (Thank You!) 26 / 27
  • 54. Introduction Irish Language in Social Media Irish Twitter POS tagging Experiments Analysis and Conclusion New Tags Verbal Nouns infinitive phrases (INF-PART + VN) progressive aspectual phrases (PREP + VN) Multiword Hashtags future work on parsing Particles relative, surname, infinitive, numeric, comparative, vocative, adverbial 27 / 27