2. 1
Introduction
• Much previous work characterizing
language variation across Internet social
groups has focused on the types of words
used by these groups.
• employing BERT to characterize variation in
the senses of words as well, analyzing two
months of English comments in 474 Reddit
communities.
3. 2
Related Work
• Online language contains an abundance of “nonstandard” words (Rotabi and Kleinberg, 2016)
• Online communities’ linguistic norms and differences are often defined by which words are used
(Zhang et al. , 2017)
• The strength of BERT to capture word senses presents a new opportunity to measure semantic
variation in online communities of practice (Devlin et al., 2019)
• different senses tend to be segregated into different regions of BERT’s embedding space
(Wiedemann et al. , 2019).
4. 1
Data
• select the top 500 most popular subreddits based on number of comments and remove
subreddits
• randomly sample 80,000 comments
• exclude too general and not specific
• removed 1044 multi-word expressions from analysis
2226 unique glossary words
7. 1
Methods for Identifying Community-Specific Language : Type
• focused on lexical choice, examining the word types unique to a community
• PMI
• TF-IDF
• TextRank
• Jensen-Shannon divergence (JSD)
8. 1
Methods for Identifying Community-Specific Language : Meaning
OW
Overwatch (r/overwatch)
Off-White (r/sneakers)
Opening Week (r/BoxOffice)
• BERT Embeddings
• clusters representatives containing word
substitutes predicted by BERT
12. 1
Conclusion
• set a foundation for further investigations on how BERT could
help define unknown words or meanings in niche communities
• Future work could develop annotated WSI datasets for online
language similar to the standard SemEval benchmarks