Successfully reported this slideshow.
Your SlideShare is downloading. ×

Tracking the Emergence of New Words across Time and Space

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Proposal defense
Proposal defense
Loading in …3
×

Check these out next

1 of 124 Ad
Advertisement

More Related Content

Similar to Tracking the Emergence of New Words across Time and Space (20)

More from Digital History (20)

Advertisement

Recently uploaded (20)

Tracking the Emergence of New Words across Time and Space

  1. 1. Tracking the Emergence of New Words across Time and Space Jack Grieve Aston University Research conducted with Diansheng Guo & Alice Kasakoff, University of South Carolina Andrea Nini, Aston University Funded as part of the Digging into Data Challenge
  2. 2. Approaches to Historical Linguistics There are several different approaches to the analysis of language change: Reconstruction through comparison of known languages (comparative method) Analysis of previous linguistic research (e.g. lexicographic research) Analysis of historical texts (corpus-based) Apparent time studies with interview data (sociolinguistics) Computer simulations
  3. 3. Lexical Change Research in historical linguistics and etymology has analysed how the usage of certain words have changed over relatively long periods of time (primarily based on historical corpora and lexicographic research), but overall there are large gaps in our knowledge of lexical change, including how newly emerging words enter a language and spread across its speakers.
  4. 4. Words are Rare Events The main problem with studying lexical variation and change is that most words are incredibly rare, thus requiring incredibly large corpora of natural language. This is why most research on lexical variation and change has focused on relatively high frequency words, primarily function words (e.g. pronouns, prepositions, auxiliary verbs).
  5. 5. Word Frequency Distribution (Zipf 1935, 1945)
  6. 6. Word Frequency Distribution (Zipf 1935, 1945)
  7. 7. The majority of the 67,000 most frequent words in our corpus occur less than once per 25 million words Word Frequency Distribution (Zipf 1935, 1945)
  8. 8. New Words are Incredibly Rare Events The analysis of new words requires even more data, because emerging words are by definition especially rare. In addition, to analyse the temporal and spatial spread of new words, large corpora must be compiled for a large number of points in times and locations.
  9. 9. Big Data Suitable data has recently become available with the rise of the social media and smartphones, which provide massive amounts of time-stamped and geo- coded natural language data.
  10. 10. Goals of Today’s Talk Identify emerging words from 2014 based on a multi- billion word corpus of American tweets. Chart their usage over time and identify common temporal patterns of lexical spread. Map their geographical diffusion and identify common spatial patterns of lexical spread.
  11. 11. The Corpus Since 2013, the team at USC have been compiling two multi-billion word geocoded corpora for the US and the UK using the Twitter API. Twitter is a particularly rich source of geocoded data and is also very popular, informal, and youthful, making it ideal for tracking the emergence of new words. Approximately 2% of tweets are geocoded.
  12. 12. The Corpus The analysis today is based on a 8.9 billion word corpus of American Tweets from October 2013- November 2014, which totals approximately 980 million Tweets from 7 million users. Every tweet is geocoded with the precise longitude and latitude of the user when posting, which were then used to identify the county where each Tweet was produced.
  13. 13. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  14. 14. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  15. 15. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  16. 16. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  17. 17. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  18. 18. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  19. 19. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  20. 20. -­‐87.684555,42.074043 Just  posted  a  photo  @  Baha'i  House  of  Worship  
  21. 21. Corpus Examples username,fips,time,tweet -­‐,48439,Sun  Jul  27  23:59:59  EDT  2014, don't  follow  the  right  ppl  lol -­‐,42007,Sun  Jul  27  23:59:59  EDT  2014, yesss  moody  judy -­‐,36005,Sun  Jul  27  23:59:59  EDT  2014, Man  i  was  just  thinking  shexx  be  lurking  but  won't  hmu -­‐,25021,Sun  Jul  27  23:59:59  EDT  2014, no  seeing  u  on  tv  is  reel  but  not  seeing  u  on  twitter   is  real  for  me...so  pls  visit  us  here  everyday. -­‐,26163,Sun  Jul  27  23:59:59  EDT  2014, Hate  seeing  my  friends  sad -­‐,12093,Sun  Jul  27  23:59:59  EDT  2014, this  is  the  shirt  i  won  that  i  got  to  sign  btw!!:)
  22. 22. Graveyard/Cemetery
  23. 23. Graveyard/Cemetery
  24. 24. Graveyard/Cemetery Percent
  25. 25. Graveyard/Cemetery Smoothed (Getis-Ord Gi)
  26. 26. Identifying Rising Words To find newly emerging words, we first measured the degree to which the usage of each word in the corpus had been rising over the 13 month period. To identify these rising words we extracted the 67,000 words that occur at least 1,000 times in the corpus and compared word relative frequency per day to day of the year using a Spearman’s rank correlation coefficient.
  27. 27. ρ = .116
  28. 28. ρ = .044
  29. 29. ρ = .044ρ = -.028
  30. 30. The Top 10 Rising Words on Twitter 2014 Word ρ Definition fuckboy 0.947 Asshole, Jerk, Poser, Tool, etc. rn 0.938 Right Now (Top Riser 2013) hbd 0.928 Happy Birthday fw 0.927 Fuck with unbothered 0.926 Unconcerned & Disengaged ft 0.925 Face time gmfu 0.924 Get me fucked up sm 0.919 So Much squad 0.919 Squad asf 0.918 As fuck
  31. 31. Identifying Emerging Words Although measuring correlations allows for rising words to be identified, most are far too common by 2014 to show patterns of regional spread. To identify emerging words we cross-referenced the list of rising words against a list of rare words, defined as words with low overall frequencies in the fourth quarter of 2013 (excluding proper nouns).
  32. 32. Top 10 Emerging Words on Twitter 2014 Words ρ Definition unbothered 0.926 Unconcerned & Disengaged gmfu 0.924 Get Me Fucked Up joggers 0.908 Jogging pants fuckboys 0.902 Losers, wimps, posers, etc. rekt 0.900 Wrecked tfw 0.879 That feel when xans 0.878 Benzodiazepine pills baeless 0.875 To be without a bae boolin 0.857 Hanging out, esp. young men lordt 0.854 Lord, as exclamation
  33. 33. Top 11-20 Emerging Words on Twitter 2014 Words ρ Definition celfie 0.852 selfie slays 0.843 impresses, succeeds at, etc. famo 0.840 family and friends fuckboi 0.838 fuckboy (on) fleek 0.838 on point, esp. eyebrows faved 0.836 to favorite something gainz 0.828 earnings bruuh 0.817 bro amirite 0.816 am I right notifs 0.808 notifications, especially online
  34. 34. http://www.google.co.uk/trends/explore#q=unbothered
  35. 35. S-shaped Curves In the time charts for many of the rising and emerging words we see clear s-curves or what look like the start of s-curves.
  36. 36. S-shaped Curves Similar results have also been found repeatedly in sociolinguistic apparent time studies (see Labov, 2001), as well as in corpus-based research in historical linguistics (e.g. Nevalainen & Raumolin-Brunberg, 2003). Similar results have also been obtained in research on the diffusion of innovations (see Rogers, 2003), where it is referred to as an S-shaped Curve of Diffusion.
  37. 37. https://www.uni-due.de/SHE/S-Curve.JPG
  38. 38. Rogers (2003: 11)
  39. 39. Summary: Time Patterns New words rise (and fall) very quickly in Modern English, with numerous new words entering the language and quickly rising in usage every year. The usage of emerging words over time tends to follow an s-shaped curve, echoing results found in sociolinguistic apparent time studies and diffusion of innovation research.
  40. 40. Goals of Today’s Talk Identify emerging words from 2014 based on a multi- billion word corpus of American tweets. Chart their usage over time and identify common temporal patterns of lexical spread. Map their geographical diffusion and identify common spatial patterns of lexical spread.
  41. 41. Mapping the Spread of New Words An important technical problem is how to map the spread of a new word across a region. One approach is to map the relative frequency (e.g. occurrences per million words) of the word across a series of regional corpora (e.g. all the tweets from a particular county) over a series of time points.
  42. 42. Geographical Diffusion of Linguistic Forms Two major theories have been proposed to explain how new linguistic forms generally spread in language: The Wave Model states that new forms spread out radially from their source. The Gravity Model states that new forms spread out from one urban area to the next, based on distance and population size, only later filling in less populated areas in between.
  43. 43. Assessing the Wave and Gravity Models We can begin assess the validity of the wave and gravity models for lexical spread by comparing the spread of unbothered. This analysis can be facilitated by focusing on one state where the form eventually becomes relatively common, for example Georgia.
  44. 44. Atlanta Columbus Macon Augusta Savannah Population Density of Georgia
  45. 45. Atlanta Columbus Macon Augusta Savannah 01 November 2013
  46. 46. Atlanta Columbus Macon Augusta Savannah 01 December 2013
  47. 47. Atlanta Columbus Macon Augusta Savannah 01 January 2014
  48. 48. Atlanta Columbus Macon Augusta Savannah 01 February 2014
  49. 49. Atlanta Columbus Macon Augusta Savannah 01 March 2014
  50. 50. Atlanta Columbus Macon Augusta Savannah 01 April 2014
  51. 51. Atlanta Columbus Macon Augusta Savannah 01 May 2014
  52. 52. Atlanta Columbus Macon Augusta Savannah 01 June 2014
  53. 53. Atlanta Columbus Macon Augusta Savannah 01 July 2014
  54. 54. Atlanta Columbus Macon Augusta Savannah 01 August 2014
  55. 55. Atlanta Columbus Macon Augusta Savannah 01 September 2014
  56. 56. Atlanta Columbus Macon Augusta Savannah 01 October 2014
  57. 57. Atlanta Columbus Macon Augusta Savannah 01 November 2014
  58. 58. Assessing the Wave and Gravity Models The geographical spread of unbothered in Georgia appears to be more complex than predicted by the Wave or Gravity Model, although both appear to offer a partial explanation for this pattern of spread The percentage of African Americans, however, also appears to be an important predictor.
  59. 59. African Americans in Georgia Atlanta Columbus Macon Augusta Savannah
  60. 60. Atlanta Columbus Macon Augusta Savannah 01 November 2014
  61. 61. 01 November 2014 Atlanta Columbus Macon Augusta Savannah
  62. 62. Presenting a time series of maps is an effective way to map lexical spread, but another technical issue is how to map emerging words on one map: Relative frequency Date of first (or second...) occurrence Number of words until first (or second...) occurrence Mapping the Spread of New Words on One Map
  63. 63. Top 10 Emerging Words on Twitter 2014 Words ρ Definition unbothered 0.926 Unconcerned & Disengaged gmfu 0.924 Get Me Fucked Up joggers 0.908 Jogging pants fuckboys 0.902 Losers, wimps, posers, etc. rekt 0.900 Wrecked tfw 0.879 That feel when xans 0.878 Benzodiazepine pills baeless 0.875 To be without a bae boolin 0.857 Hanging out, esp. young men lordt 0.854 Lord, as exclamation
  64. 64. Top 11-20 Emerging Words on Twitter 2014 Words ρ Definition celfie 0.852 selfie slays 0.843 impresses, succeeds at, etc. famo 0.840 family and friends fuckboi 0.838 fuckboy (on) fleek 0.838 on point, esp. eyebrows faved 0.836 to favorite something gainz 0.828 earnings bruuh 0.817 bro amirite 0.816 am I right notifs 0.808 notifications, especially online
  65. 65. Summary: Regional Patterns New words originate from across the US, including the Southeast (e.g. Unbothered, Baeless, Boolin), the North (e.g. Fuckboy, Gainz), and the West (e.g. Wrekt), and tend to spread within these regions first. Otherwise, the spread of new words appears to be highly complex, affected by numerous factors, including proximity, population density, and demographic patterns.
  66. 66. Traditional Approaches to Historical Linguistics The empirical analysis of language change is generally based on historical corpora, which tend to span centuries, or collections of linguistic interviews, which tend to span generations (i.e. based on apparent time). Both sources of data tend to provide a broad temporal scope but limited temporal resolution and amounts of data (<1 million words).
  67. 67. The Uniformitarian Principle “Knowledge of processes that operated in the past can be inferred by observing ongoing processes in the present” (Christy, 1983: ix). This Uniformitarian Principle is cited in Labov (2001) to justify the use of apparent time interview data in place of historical corpora, but it also justifies the use of extremely large and dense contemporary corpora in place of both of these more common approaches.
  68. 68. A Modern Approach to Historical Linguistics Analysing with modern language data mined from online sources allows for unprecedentedly large, rich and dense natural language corpora to be compiled. Although historical scope is lost, this approach allows for language change to be analysed in far greater detail than would otherwise be possible.
  69. 69. Tracking the Emergence of New Words across Time and Space Jack Grieve Centre for Forensic Linguistics Aston University Email: j.grieve1@aston.ac.uk Website: https://sites.google.com/site/jackgrieveaston Twitter: @JWGrieve

×