The Linguistics of Twitter - PyCon 2011 Presentation

4,656 views

Published on

'The Linguistics of Twitter' presentation from PyCon 2011 which I hope starts a dialogue about what we need to accurately measure the effects of social media.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,656
On SlideShare
0
From Embeds
0
Number of Embeds
721
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Potential Solutions Methodology via Peter Norvig Beautiful Data, Ch14
  • Potential Solutions Methodology via Peter Norvig Beautiful Data, Ch14
  • German translation of the Declaration of Independence 7/9/1776
  • But What Can We Use As A Guide?
  • Ebonics is not the correct terminology.
  • Center For Applied Linguistics. " Like other dialects of English, AAE is a regular, systematic language variety that contrasts with other dialects in terms of its grammar, pronunciation, and vocabulary."
  • The Linguistics of Twitter - PyCon 2011 Presentation

    1. 1. American English Regional Dialects Changing Speech Patterns Changing Online Measurement Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy @MichaelDHealy
    2. 2. Michael D. Healy <ul><ul><li>Econometrics </li></ul></ul><ul><ul><li>Linguistics </li></ul></ul><ul><ul><li>Not an Engineer </li></ul></ul><ul><li>Measuring and Influencing Online and Offline Behavior </li></ul><ul><li>Why am I here? </li></ul><ul><li>This Seemed Like an Interesting Problem </li></ul>@MichaelDHealy
    3. 3. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><ul><li>Data Collection Interlude </li></ul></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul><ul><ul><ul><li>Sort Of </li></ul></ul></ul>@MichaelDHealy
    4. 4. Introduction: Hawaiian Pidgin Video @MichaelDHealy
    5. 5. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul>@MichaelDHealy
    6. 6. Background <ul><li>Regional Differences In Word Choice </li></ul>@MichaelDHealy MrEverything6's Tweet Dallas, Texas Region coke - Coca-Cola or soft drink in general? Coca-Cola Probably Wants To Know
    7. 7. Background <ul><li>Regional Differences In Pronunciation </li></ul><ul><li>More Than Just Drawl </li></ul>@MichaelDHealy pin Is that: Pin a tail on the donkey. -OR- Give me a 'pin' to write with.
    8. 8. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul>@MichaelDHealy
    9. 9. Where We Stand @MichaelDHealy
    10. 10. Where We Stand @MichaelDHealy
    11. 11. Detailed Dialectical Map Detailed Dialectical Map http://aschmann.net/AmEng/
    12. 12. Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? They Don't Speak The King's English! 1) America Doesn't Have A King
    13. 13. Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? 2) English Doesn't Have An Authority Like: French: L'Académie française Spanish: Asociación de Academias de la Lengua Española Numerous Others: http://en.wikipedia.org/wiki/List_of_language_regulators
    14. 14. Where We Stand @MichaelDHealy Who Is Right? Everyone Prescriptive Linguistics: Tell You What Is Right Descriptive Linguistics: Describe How You Communicate Trying To Sell More Widgets? Probably Descriptive Is Best
    15. 15. Where We Stand @MichaelDHealy <ul><li>Selected American English Dialects: </li></ul><ul><ul><li>New England </li></ul></ul><ul><ul><li>Northern </li></ul></ul><ul><ul><li>North Midland </li></ul></ul><ul><ul><li>South Midland </li></ul></ul><ul><ul><li>NYC </li></ul></ul><ul><ul><li>Western </li></ul></ul><ul><ul><li>AAVE </li></ul></ul><ul><ul><li>Hawaiian Pidgin </li></ul></ul>
    16. 16. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul>@MichaelDHealy
    17. 17. Historical Context @MichaelDHealy Linguists Thought TV  Would Make Us All Sound The Same Think Tom Brokaw Area of 'Standard American English' Not Overly Large Not Largely Populated
    18. 18. Historical Context @MichaelDHealy Been To Wisconsin? Seen Fargo? Biggest Change In Spoken English Since 1750 Going On Right Now - After TV 'Oh yeah? Yeah'
    19. 19. Historical Context @MichaelDHealy Portions Of America Experience Some or All of Northern Cities Vowel Shift
    20. 20. Historical Context @MichaelDHealy Sum This Up: People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects
    21. 21. Historical Context @MichaelDHealy America Has Been Multi-Lingual Since July 9, 1776
    22. 22. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul>@MichaelDHealy
    23. 23. Where We May Be Going @MichaelDHealy
    24. 24. Where We May Be Going @MichaelDHealy ~ 74% of Americans Live In A Megaregion Megaregions Tied To Existing Dialect Regions
    25. 25. Where We May Be Going @MichaelDHealy William Labov, PhD. Professor of Linguistics University of Pennsylvania http://www.ling.upenn.edu/~wlabov/ Pretty Much The Authority on American English Dialects 'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'
    26. 26. Plan of Action <ul><ul><li>Background </li></ul></ul><ul><ul><li>Where We Stand </li></ul></ul><ul><ul><li>Historical Context </li></ul></ul><ul><ul><li>Where We May Be Going </li></ul></ul><ul><ul><li>Potential Solutions </li></ul></ul>@MichaelDHealy
    27. 27. Potential Solutions <ul><li>One American Dialect Is Unique In Geography: </li></ul><ul><li>African-American Vernacular English (AAVE) </li></ul><ul><li>Not In A Geographically Contiguous Region </li></ul>@MichaelDHealy
    28. 28. Potential Solutions @MichaelDHealy Center For Applied Linguistics. &quot;Thats the way baseball go.&quot;
    29. 29. Potential Solutions @MichaelDHealy Correct the Spelling & Grammar import enchant from nltk.metrics import edit_distance class SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word) if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word
    30. 30. Potential Solutions @MichaelDHealy Example 1 well im gonna go so i’ll talk to u lata  1 Corrected Example 1 Well mi Donna go so I'll talk to U late
    31. 31. Potential Solutions @MichaelDHealy Build Out a Dictionary of Words Regex Match and Replace proper_words = { 'hater': ['enemy','jealous individual','not friend'] 'coke': ['coke', 'soda', 'pop'] } Which Region?
    32. 32. Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1
    33. 33. Potential Solutions @MichaelDHealy import re replacement_patterns = [ (r'gotta', 'got to'), (r&quot;i'll&quot;, 'I will'), ('aight','all right') ] class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s
    34. 34. Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1 well i got to go, I will talk to you later All right Bye 1 (!?)
    35. 35. Potential Solutions @MichaelDHealy Example 2 well i got to go, I will talk to you later All right Bye 1 (!?) Here '1' has the concept of: I understand
    36. 36. Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus First Step: Tag Existing Data import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') def tokenize(para): print tokenizer.tokenize(para)
    37. 37. Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol Tokenized as: 'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol' So lots of custom work to be done . .
    38. 38. Potential Solutions @MichaelDHealy _andBeautyKills: – after tonight, don’t leave your boy roun’ me, umma #true playa fareal. Local To SF: Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol
    39. 39. Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo import simplegeo.shared, simplegeo.places from simplegeo.shared import Feature client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret') properties = {&quot;province&quot;:&quot;CA&quot;,&quot;city&quot;:&quot;San Francisco&quot;,&quot;name&quot;:&quot;SimpleGeo SF&quot;, &quot;country&quot;:&quot;US&quot;, &quot;phone&quot;:&quot;+1 415 626 1375&quot;,&quot;address&quot;:&quot;41 Decatur St&quot;, &quot;postcode&quot;:&quot;94103&quot;} f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties) client.add_feature(f) 'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'
    40. 40. Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo: Queries import simplegeo.places def start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('n') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word
    41. 41. Potential Solutions: SimpleGeo-Tools @MichaelDHealy import simplegeo.places import simplegeo.context class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('n') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results
    42. 42. Potential Solutions: Connect the APIS @MichaelDHealy
    43. 43. References @MichaelDHealy Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/ A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J., O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October 2010. You are where you tweet: a content-based approach to geo-locating twitter users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management, 2010
    44. 44. References @MichaelDHealy Repustate: Sentiment Analysis API http://repustate.com/ Rapleaf Personalization API https://www.rapleaf.com/ SimpleGeo GIS Solution API http://simplegeo.com/
    45. 45. Michael D. Healy SimpleGeo-Tools @MichaelDHealy Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools

    ×