Language change and innovation is constant in online and offline communication, and has led to new words entering people's lexicon and even entering modern day dictionaries, with recent additions of `e-cig' and `vape'. However the manual work required to identify these `innovations' is both time consuming and subjective. In this work we demonstrate how such innovations in language can be identified across two different OSN's (Online Social Networks) through the operationalisation of known language acceptance models that incorporate relatively simple statistical tests. From grounding our work in language theory, we identified three statistical tests that can be applied - variation in; frequency, form and meaning. Each show different success rates across the two networks (Geo-bound Twitter sample and a sample of Reddit). These tests were also applied to different community levels within the two networks allowing for different innovations to be identified across different community structures over the two networks, for instance: identifying regional variation across Twitter, and variation across groupings of Subreddits, where identified example innovations included `casualidad' and 'cym'.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Towards Modelling Language Innovation Acceptance in Online Social Networks
1. Towards Modelling Language Innovation Acceptance in
Online Social Networks
24th November 2015
Daniel Kershaw – d.kershaw1@lancaster.ac.uk
2. Daniel Kershaw
Computer Science BSc – Lancaster University – 2009
Digital Innovation MRes – Highwire – Lancaster University – 2010
PhD Candidate – 2010 – Now
Supervisors
Dr. Matthew Rowe – School of Computing and Communication (SCC)
Dr. Patrick Stacey – Management Science (LUMS)
Research Area:
Social Computing
Big Data / Big Data Systems
Who Am I
3. “Language, never forget, is more like fashion than
science, and matters of usage, spelling, and
pronunciation tend to wander around like
hemlines”
- Bill Bryson, The Mother Tongue: English and How
It Got That Way
4. Language is in constant change
Online communication adds extra pressure though the merging of time and
space
– Awesomesauce
– Bants
– beer o’clock
– brain fart
– Brexit
– bruh
Language is Contently Changing
5. State of the Art - Detection Of Innovation
Three studies
1. Looking for term “this mean”/“is defined as”
2. Using know heuristics of blends to detect origins
3. Detecting changes in semantic orientation or words
Cook, P., & Stevenson, S. (2007)
Cook, P., & Stevenson, S. (2010)
6. State of the Art - Diffusion
Identify words that exist in small time frame
Model diffusion using monri-carlo simulations
Showed existence of wave and gravity diffusion
models.
Does not detect local community innovations
Eisenstein, J., O'Connor, B., Smith, N. A., & Xing, E. P. (2012, October)
7. State of the Art - Change in meaning
Showed a change in meaning of
words
Performed on Google N-grams
dataset
Removed the concept of community
Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S.
8. State of the Art - User Language Change
Change in language as a tool to predict
users leaving social network
Initially language conforms to the group
Before they level the language bears away
from the language of the group
Danescu-Niculescu-Mizil C, West R, Jurafsky D, Leskovec J, Potts C
9. 1. Word innovation acceptance models through computation means
2. Identification of local and global acceptance
3. Multiple network analysis
4. Large Corpus Analysis
Contribution
10. Metcalf’s Fudge
Frequency of the word
Unobtrusiveness of the rod
Diversity of users and situations
Generation of other forms and
meanings
Endurance of the concept
Grounded Models
Barnhart’s Vfrgt
(V) Number of forms
(F) Frequency of word
(R) Number of sources
(G) Number of genera
(T) Time Span of Word
Linguists and lexicographers aim to understand language
Developed heuristics to aid the decision to include words in dictionaries
11. The Data
Twitter Reddit
Users 3,108,844 ≈25,000,000
Posts 73,528,954 ≈500,000,000
Communities 3046 121,373
Words (n > 200) 373,217 2,712,629
Time Periods 283 days 880 days
Data gathered from the June 2015 Reddit Data Release https://goo.gl/j116ML
12. Data Groupings
Group by Time
Day of year
Geo Location for Twitter
UK -> North West -> LA -> LA1
Subreddit for Reddit
Reddit -> meta interest group ->
subreddit
13. Variation in Frequency
Assess changed in raw frequency and user frequency over time
Diversity in Form
Assess users adoption of varying forms e.g. additions of ing
Diversity in Meaning
Over time can we see a convergence in meaning of the word
Measures
14. • BNC (British National Corpus) – Gold
standard of English
• Filter out:
– Hash tags
– URLs
– Punctuation
– Emoticons / Emoji
• Light Normalization:
– soooooooo -> soo
What is an Innovation
Twitter Reddit
# innovation
(N > 200)
62,141 373,217
15. • Normalized word count
– Per time period
– Per community
• Normalized user word count
– Per time period
– Per community
Variation in Frequency
16. Assess the prefix and suffix addition of an
innovation
List of prefix and suffixes from the OED
– apple -> apples
– hero -> antihero
Diversity in Form
17. Diversity in Meaning
Looking for innovations that have not
been seen before
No solidified meaning within existing
systems e.g. WordNet
18. Looking for innovations that have not
been seen before
No solidified meaning within existing
systems e.g. WordNet
Learns the embedding of words within a
corpus using word2vec
Developed by Google in 2009
Uses documents to train neural net
Diversity in Meaning - word2vec
19. User the data grouping; time and location
Train w2v model each split of data e.g.
London, week1
Query model with each innovation against
model (top 100 synonyms) e.g. fleek
Compute similarity between each region in
a time period for an innovation e.g. week 1
fleek
Diversity in Meaning
20. Looking for statistically significant growth or
decay of an innovation
Presume language change happens in a
monotonic fashion
Fit Spearman's rank to each time series
X value is days since start of data
Y value is normalized frequency of word
Value range -1 to 1
Sampling the Data
21. Sampling the Data
Class statistically significate change as above
and bellow the 95% confidence interval.
39. • Susceptible to excessive usage of a word
• Solution could be:
– Smoothing of data
– Sampling to give equal representation of word
Diversity in Meaning
40.
41. Word of the Year
Collins
binge-watch, verb
clean eating, noun
contactless, adjective
Corbynomics, noun
dadbod, noun
ghosting, noun
manspreading, noun
shaming, noun
swipe, verb
Transgender, adjective
Oxford
😂
Ad blocker, noun
Brexit, noun
Dark Web, noun
On fleek, adjective phrase
Lumber serxual, noun
Refugee, noun
Sharing economy, noun
They (singular), pronoun
49. • Is language dependent on community structure
– Modeling Social Reinforcement and Homophile
– Again modeling on multiple levels and across different networks
• Is diffusion effected by the form of network e.g. geographical Twitter or cros
posting on Reddit
• Who are the most influential in language innovation and adoption
– Fitting of general threshold models to predict when people adopt a term
– How to perform this at scale
Where next
51. Eisenstein, J., O'Connor, B., Smith, N. A., & Xing, E. P. (2012, October). Mapping
the geographical diffusion of new words. arXiv.org.
Eisenstein, J., O'Connor, B., Smith, N. A., & Xing, E. P. (2014). Diffusion of Lexical
Change in Social Media. PLoS ONE, 9(11), e113114.
http://doi.org/10.1371/journal.pone.0113114
Goldberg, Y., & Levy, O. (2014, February 15). word2vec Explained: deriving
Mikolov et al.'s negative-sampling word-embedding method.
Metcalf, A. A. (2004). Predicting New Words. Houghton Mifflin Harcourt.
Barnhart, D. K. (2007). A Calculus for New Words, 28(1), 132–138.
http://doi.org/10.1353/dic.2007.0009
References
52. Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2014, November 12).
Statistically Significant Detection of Linguistic Change. arXiv.org. PeerJ Inc.
http://doi.org/10.7717/peerj.68/table-1
Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., & Potts, C.
(2013). No country for old members: user lifecycle and linguistic change in
online communities (pp. 307–318). International World Wide Web Conferences
Steering Committee.
Cook, P., Han, B., & Baldwin, T. (n.d.). Statistical Methods for Identifying Local
Dialectal Terms from GPS-Tagged Documents.
Cook, P., & Stevenson, S. (2007). Automagically inferring the source words of
lexical blends. Presented at the Proceedings of the Tenth Conference of the ….
References
Editor's Notes
LANCASTER UNIVERSITY POWERPOINT TEMPLATE (16:9)
These PowerPoint templates are for use by all University staff. Please see below for further information regarding the use of these templates. Should you have any further queries, please contact the marketing team via webmaster@lancaster.ac.uk
Template slide 3: Insert a new slide
If you need to insert a new slide, from the ‘home’ toolbar, click on ‘new slide’ and select from the templates the style you require from the dropdown box.
Template slide 4: Typing new text and copying text from another document
New text should be typed over the text in the appropriate template. Copy and pasting text from another document will result in changing the style of the typography and layout. This is unavoidable as it is part of the Microsoft software. We appreciate that in sometimes you will need to copy text from another document into this template. Once you have pasted the existing text into the template, you will need to change the formatting so that they typefaces, sizes, colour, line spacing and alignment are consistent with the rest of the template.
Template slide 5: Inserting images
There are three choices of templates with images already inserted. Please use the template with the relevant image size and positioning. To insert an image, please go to ‘insert’ then ‘picture’ and find your image, highlight it and ‘insert’. Resize the image and position as per the example template.
Template slide 6: Text boxes
If a text box is deleted, either insert a new slide (using the appropriate template) or go to another slide and copy a text box. To select a text box for copying, please click on the outer edge of the text box so that the line goes solid (not dashed). Right click your mouse and select ‘copy’, then go back and ‘paste’ it into the slide where the text box is missing which should paste into the correct position on the slide.
Template slide 7: Other information
Typefaces, sizes and coloursAll copy is Calibri.
Slide title copy throughout:
Size: 28 point
Colour Lancaster University red: (RGB) R: 181 G: 18 B: 27 (recent colours on PowerPoint)Small copy on first and last slide:Size: 16 pointColour grey: (RGB) R: 102 G: 102 B: 102 (recent colours on PowerPoint)
Sub-headings:Size: 20 point – italicsColour grey: (RGB) R: 102 G: 102 B: 102 (recent colours on PowerPoint)
Bullets copy and body copy:Body Copy and first bullet: Size: 20 point (Second level bullet 19pt, third level bullet 18pt, forth level bullet 17pt, fifth level bullet 16pt)
Colour grey: (RGB) R: 102 G: 102 B: 102 (recent colours on PowerPoint)
Line spacing and alignment
Slide titles have a line spacing of - 0.8pt
Body copy has single line spacing
All text is aligned left
Slide title options
There are two options for titles on the slides – one line title (Slide 9) or two line title (Slide 1) for longer titles. Ideally, the one line title should be used, however on rare occasions a two line title maybe needed.
Blends - We use linguis- tic and cognitive aspects of this process to motivate a computational treatment of neologisms formed by blending.
Variation of multaial information messues
Variation was asses though computing again a variation in PMI in corrolaition with search engin query return results, issues are that the seach engin is a blacl box,
Mention POS tagging, change point detection, Applied to different networks, distribution of the usage of the word