SlideShare a Scribd company logo
1 of 13
Download to read offline
2023/3/6
Characterizing English Variation
across Social Media Communities with
BERT
유용상
TACL 2021
1
Introduction
• Much previous work characterizing
language variation across Internet social
groups has focused on the types of words
used by these groups.
• employing BERT to characterize variation in
the senses of words as well, analyzing two
months of English comments in 474 Reddit
communities.
2
Related Work
• Online language contains an abundance of “nonstandard” words (Rotabi and Kleinberg, 2016)
• Online communities’ linguistic norms and differences are often defined by which words are used
(Zhang et al. , 2017)
• The strength of BERT to capture word senses presents a new opportunity to measure semantic
variation in online communities of practice (Devlin et al., 2019)
• different senses tend to be segregated into different regions of BERT’s embedding space
(Wiedemann et al. , 2019).
1
Data
• select the top 500 most popular subreddits based on number of comments and remove
subreddits
• randomly sample 80,000 comments
• exclude too general and not specific
• removed 1044 multi-word expressions from analysis
2226 unique glossary words
1
Word Sense Disambiguation
1. 배가 불러서 더 이상 못 먹겠다.
2. 올 해에는 배가 풍년이다.
3. 내가 더보다 몇 배는 더 빠르다.
4. 사촌이 땅을 사면 내 배가 아프다.
1
Word Sense Induction
1. 배가 불러서~~
4. 내 배가 아프다~~
3. 몇 배는 더~~
2. ~~배가 풍년이다.
1
Methods for Identifying Community-Specific Language : Type
• focused on lexical choice, examining the word types unique to a community
• PMI
• TF-IDF
• TextRank
• Jensen-Shannon divergence (JSD)
1
Methods for Identifying Community-Specific Language : Meaning
OW
Overwatch (r/overwatch)
Off-White (r/sneakers)
Opening Week (r/BoxOffice)
• BERT Embeddings
• clusters representatives containing word
substitutes predicted by BERT
1
Methods for Identifying Community-Specific Language :
Meaning(Cont’d)
1
Evaluation
1
Evaluation
1
Conclusion
• set a foundation for further investigations on how BERT could
help define unknown words or meanings in niche communities
• Future work could develop annotated WSI datasets for online
language similar to the standard SemEval benchmarks
감사합니다

More Related Content

Similar to 230305_Characterizing English Variation across Social Media Communities with BERT

Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 
Day 2 introduction to corpus work
Day 2  introduction to corpus workDay 2  introduction to corpus work
Day 2 introduction to corpus workNikki Mattson
 
Investigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyInvestigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyEllen Head
 
Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Shehnaz Mehboob
 
Corpora and the Lexical Approach
Corpora and the Lexical ApproachCorpora and the Lexical Approach
Corpora and the Lexical ApproachDaniel_Lowe_1
 
Sociolinguistic and law
Sociolinguistic and lawSociolinguistic and law
Sociolinguistic and lawMd Syed Ahamad
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learningnfuadah123
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary usersDuygu Aşıklar
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Textbook Eval Workshop
Textbook Eval WorkshopTextbook Eval Workshop
Textbook Eval WorkshopJoshua Durey
 
Cau3 spoken language - edex
Cau3   spoken language - edexCau3   spoken language - edex
Cau3 spoken language - edexhdowd84
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicographysyila239
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfFaishaMaeTangog
 
A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011Olaf Witkowski
 
10. noun clauses
10. noun clauses10. noun clauses
10. noun clausesIECP
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN
 

Similar to 230305_Characterizing English Variation across Social Media Communities with BERT (20)

Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Day 2 introduction to corpus work
Day 2  introduction to corpus workDay 2  introduction to corpus work
Day 2 introduction to corpus work
 
Investigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of FluencyInvestigating Teachers' Perceptions of Fluency
Investigating Teachers' Perceptions of Fluency
 
Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4Introduction to sociolinguistics ch 1 4
Introduction to sociolinguistics ch 1 4
 
Corpora and the Lexical Approach
Corpora and the Lexical ApproachCorpora and the Lexical Approach
Corpora and the Lexical Approach
 
Syntax
SyntaxSyntax
Syntax
 
Sociolinguistic and law
Sociolinguistic and lawSociolinguistic and law
Sociolinguistic and law
 
Corpus linguistics in language learning
Corpus linguistics in language learningCorpus linguistics in language learning
Corpus linguistics in language learning
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary users
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Textbook Eval Workshop
Textbook Eval WorkshopTextbook Eval Workshop
Textbook Eval Workshop
 
Cau3 spoken language - edex
Cau3   spoken language - edexCau3   spoken language - edex
Cau3 spoken language - edex
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdf
 
A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011A Two-Speed Language Evolution - Protolang Torun - September 2011
A Two-Speed Language Evolution - Protolang Torun - September 2011
 
10. noun clauses
10. noun clauses10. noun clauses
10. noun clauses
 
Mari-Carmen Mendez Garcia
Mari-Carmen Mendez GarciaMari-Carmen Mendez Garcia
Mari-Carmen Mendez Garcia
 
Barrie roberts
Barrie robertsBarrie roberts
Barrie roberts
 
ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)ICANN 51: IDN Root Zone LGR (workshop)
ICANN 51: IDN Root Zone LGR (workshop)
 

More from YongSang Yoo

20230727_tinystories
20230727_tinystories20230727_tinystories
20230727_tinystoriesYongSang Yoo
 
221220_페르소나챗봇
221220_페르소나챗봇221220_페르소나챗봇
221220_페르소나챗봇YongSang Yoo
 
230223_Knowledge_Distillation
230223_Knowledge_Distillation230223_Knowledge_Distillation
230223_Knowledge_DistillationYongSang Yoo
 
221108_Multimodal Transformer
221108_Multimodal Transformer221108_Multimodal Transformer
221108_Multimodal TransformerYongSang Yoo
 

More from YongSang Yoo (10)

20230727_tinystories
20230727_tinystories20230727_tinystories
20230727_tinystories
 
20230608_megabyte
20230608_megabyte20230608_megabyte
20230608_megabyte
 
221220_페르소나챗봇
221220_페르소나챗봇221220_페르소나챗봇
221220_페르소나챗봇
 
220920_AI ETHICS
220920_AI ETHICS220920_AI ETHICS
220920_AI ETHICS
 
230309_LoRa
230309_LoRa230309_LoRa
230309_LoRa
 
230223_Knowledge_Distillation
230223_Knowledge_Distillation230223_Knowledge_Distillation
230223_Knowledge_Distillation
 
221108_Multimodal Transformer
221108_Multimodal Transformer221108_Multimodal Transformer
221108_Multimodal Transformer
 
221011_BERT
221011_BERT221011_BERT
221011_BERT
 
220910_GatedRNN
220910_GatedRNN220910_GatedRNN
220910_GatedRNN
 
220906_Glove
220906_Glove220906_Glove
220906_Glove
 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 

Recently uploaded (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 

230305_Characterizing English Variation across Social Media Communities with BERT

  • 1. 2023/3/6 Characterizing English Variation across Social Media Communities with BERT 유용상 TACL 2021
  • 2. 1 Introduction • Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. • employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities.
  • 3. 2 Related Work • Online language contains an abundance of “nonstandard” words (Rotabi and Kleinberg, 2016) • Online communities’ linguistic norms and differences are often defined by which words are used (Zhang et al. , 2017) • The strength of BERT to capture word senses presents a new opportunity to measure semantic variation in online communities of practice (Devlin et al., 2019) • different senses tend to be segregated into different regions of BERT’s embedding space (Wiedemann et al. , 2019).
  • 4. 1 Data • select the top 500 most popular subreddits based on number of comments and remove subreddits • randomly sample 80,000 comments • exclude too general and not specific • removed 1044 multi-word expressions from analysis 2226 unique glossary words
  • 5. 1 Word Sense Disambiguation 1. 배가 불러서 더 이상 못 먹겠다. 2. 올 해에는 배가 풍년이다. 3. 내가 더보다 몇 배는 더 빠르다. 4. 사촌이 땅을 사면 내 배가 아프다.
  • 6. 1 Word Sense Induction 1. 배가 불러서~~ 4. 내 배가 아프다~~ 3. 몇 배는 더~~ 2. ~~배가 풍년이다.
  • 7. 1 Methods for Identifying Community-Specific Language : Type • focused on lexical choice, examining the word types unique to a community • PMI • TF-IDF • TextRank • Jensen-Shannon divergence (JSD)
  • 8. 1 Methods for Identifying Community-Specific Language : Meaning OW Overwatch (r/overwatch) Off-White (r/sneakers) Opening Week (r/BoxOffice) • BERT Embeddings • clusters representatives containing word substitutes predicted by BERT
  • 9. 1 Methods for Identifying Community-Specific Language : Meaning(Cont’d)
  • 12. 1 Conclusion • set a foundation for further investigations on how BERT could help define unknown words or meanings in niche communities • Future work could develop annotated WSI datasets for online language similar to the standard SemEval benchmarks