SlideShare a Scribd company logo
1 of 45
American English Regional Dialects Changing Speech Patterns Changing Online Measurement Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy @MichaelDHealy
Michael D. Healy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Introduction:  Hawaiian Pidgin Video @MichaelDHealy
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Background ,[object Object],@MichaelDHealy MrEverything6's Tweet Dallas, Texas Region coke - Coca-Cola or soft drink in general? Coca-Cola Probably Wants To Know
Background ,[object Object],[object Object],@MichaelDHealy pin Is that: Pin a tail on the donkey. -OR- Give me a 'pin' to write with.
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Where We Stand @MichaelDHealy
Where We Stand @MichaelDHealy
Detailed Dialectical Map Detailed Dialectical Map http://aschmann.net/AmEng/
Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? They Don't Speak The King's English! 1) America Doesn't Have A King
Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? 2) English Doesn't Have An Authority Like: French: L'Académie française Spanish: Asociación de Academias de la Lengua Española Numerous Others: http://en.wikipedia.org/wiki/List_of_language_regulators
Where We Stand @MichaelDHealy Who Is Right? Everyone Prescriptive Linguistics: Tell You What Is Right Descriptive Linguistics: Describe How You Communicate Trying To Sell More Widgets? Probably Descriptive Is Best
Where We Stand @MichaelDHealy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Historical Context @MichaelDHealy Linguists Thought TV  Would Make Us All Sound The Same Think Tom Brokaw Area of 'Standard American English' Not Overly Large Not Largely Populated
Historical Context @MichaelDHealy Been To Wisconsin? Seen Fargo? Biggest Change In Spoken English Since 1750 Going On Right Now - After TV 'Oh yeah? Yeah'
Historical Context @MichaelDHealy Portions Of America Experience Some or All of Northern Cities Vowel Shift
Historical Context @MichaelDHealy Sum This Up: People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects
Historical Context @MichaelDHealy America Has Been Multi-Lingual Since July 9, 1776
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Where We May Be Going @MichaelDHealy
Where We May Be Going @MichaelDHealy ~ 74% of Americans Live In A Megaregion Megaregions Tied To Existing Dialect Regions
Where We May Be Going @MichaelDHealy William Labov, PhD. Professor of Linguistics University of Pennsylvania http://www.ling.upenn.edu/~wlabov/ Pretty Much The Authority on American English Dialects 'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'
Plan of Action ,[object Object],[object Object],[object Object],[object Object],[object Object],@MichaelDHealy
Potential Solutions ,[object Object],[object Object],[object Object],@MichaelDHealy
Potential Solutions @MichaelDHealy Center For Applied Linguistics. "Thats the way baseball go."
Potential Solutions @MichaelDHealy Correct the Spelling & Grammar import enchant from nltk.metrics import edit_distance class SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word) if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word
Potential Solutions @MichaelDHealy Example 1 well im gonna go so i’ll talk to u lata  1 Corrected Example 1 Well mi Donna go so I'll talk to U late
Potential Solutions @MichaelDHealy Build Out a Dictionary of Words Regex Match and Replace proper_words = { 'hater': ['enemy','jealous individual','not friend'] 'coke': ['coke', 'soda', 'pop'] } Which Region?
Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1
Potential Solutions @MichaelDHealy import re replacement_patterns = [ (r'gotta', 'got to'), (r&quot;iapos;ll&quot;, 'I will'), ('aight','all right') ] class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s
Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1 well i got to go, I will talk to you later All right Bye 1 (!?)
Potential Solutions @MichaelDHealy Example 2 well i got to go, I will talk to you later All right Bye 1 (!?) Here '1' has the concept of: I understand
Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus First Step: Tag Existing Data import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') def tokenize(para): print tokenizer.tokenize(para)
Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol Tokenized as: 'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol' So lots of custom work to be done . .
Potential Solutions @MichaelDHealy _andBeautyKills: – after tonight, don’t leave your boy roun’ me, umma #true playa fareal. Local To SF: Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol
Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo import simplegeo.shared, simplegeo.places from simplegeo.shared import Feature client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret') properties = {&quot;province&quot;:&quot;CA&quot;,&quot;city&quot;:&quot;San Francisco&quot;,&quot;name&quot;:&quot;SimpleGeo SF&quot;,   &quot;country&quot;:&quot;US&quot;, &quot;phone&quot;:&quot;+1 415 626 1375&quot;,&quot;address&quot;:&quot;41 Decatur St&quot;,   &quot;postcode&quot;:&quot;94103&quot;} f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties) client.add_feature(f) 'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'
Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo: Queries import simplegeo.places def start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word
Potential Solutions: SimpleGeo-Tools @MichaelDHealy import simplegeo.places import simplegeo.context class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results
Potential Solutions: Connect the APIS @MichaelDHealy
References @MichaelDHealy Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/ A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J.,  O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the  Conference on Empirical Methods in Natural Language Processing,  Cambridge, MA, October 2010. You are where you tweet: a content-based approach to geo-locating twitter  users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of  the 19th ACM international conference on Information and knowledge  management, 2010
References @MichaelDHealy Repustate: Sentiment Analysis API http://repustate.com/ Rapleaf Personalization API https://www.rapleaf.com/ SimpleGeo GIS Solution API http://simplegeo.com/
Michael D. Healy   SimpleGeo-Tools @MichaelDHealy Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools

More Related Content

Similar to The Linguistics of Twitter - PyCon 2011 Presentation

Proyecto Final
Proyecto FinalProyecto Final
Proyecto FinalHerman Hall
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Lucidworks
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...Seth Grimes
 
Webquest
WebquestWebquest
Webquestspanish125
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense MakingeXascale Infolab
 
Product Keynote: Confluence and Trello
Product Keynote: Confluence and TrelloProduct Keynote: Confluence and Trello
Product Keynote: Confluence and TrelloAtlassian
 
2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcommYannick Wurm
 
God Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov DreamGod Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov DreamDaniel Kornev
 
Owning the Answer Box, Knowledge Graph and Featured Snippets
Owning the Answer Box, Knowledge Graph and Featured SnippetsOwning the Answer Box, Knowledge Graph and Featured Snippets
Owning the Answer Box, Knowledge Graph and Featured Snippetsalanbleiweiss
 
Cómo Java afecta nuestros Diseños
Cómo Java afecta nuestros DiseñosCómo Java afecta nuestros Diseños
Cómo Java afecta nuestros DiseñosHernan Wilkinson
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizersHa Loc Do
 
Writing Apps the Google-y Way (Brisbane)
Writing Apps the Google-y Way (Brisbane)Writing Apps the Google-y Way (Brisbane)
Writing Apps the Google-y Way (Brisbane)Pamela Fox
 
Introducing new vocabulary
Introducing new vocabularyIntroducing new vocabulary
Introducing new vocabularyAyla Sarı
 
Securing and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data MiningSecuring and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data MiningJonathan LeBlanc
 
EN Time Boxing Step by Step by Slidesgo.pptx
EN Time Boxing Step by Step by Slidesgo.pptxEN Time Boxing Step by Step by Slidesgo.pptx
EN Time Boxing Step by Step by Slidesgo.pptxWendy201157
 
Housing Prediction
Housing PredictionHousing Prediction
Housing PredictionWei Ying
 
PPT Slides Go.pptx
PPT Slides Go.pptxPPT Slides Go.pptx
PPT Slides Go.pptxbeethoven5869
 
KIMIA MEDISINAL
KIMIA MEDISINALKIMIA MEDISINAL
KIMIA MEDISINALAsrunJr
 
Test Assessment
Test AssessmentTest Assessment
Test AssessmentJiana Sanchez
 

Similar to The Linguistics of Twitter - PyCon 2011 Presentation (20)

Proyecto Final
Proyecto FinalProyecto Final
Proyecto Final
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:‹ Challenges in Comprehensive Corpu...
 
Webquest
WebquestWebquest
Webquest
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
Product Keynote: Confluence and Trello
Product Keynote: Confluence and TrelloProduct Keynote: Confluence and Trello
Product Keynote: Confluence and Trello
 
2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm2013 10-16-sbc3610-research methcomm
2013 10-16-sbc3610-research methcomm
 
God Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov DreamGod Mode for designing scenario-driven skills for DeepPavlov Dream
God Mode for designing scenario-driven skills for DeepPavlov Dream
 
Owning the Answer Box, Knowledge Graph and Featured Snippets
Owning the Answer Box, Knowledge Graph and Featured SnippetsOwning the Answer Box, Knowledge Graph and Featured Snippets
Owning the Answer Box, Knowledge Graph and Featured Snippets
 
Cómo Java afecta nuestros Diseños
Cómo Java afecta nuestros DiseñosCómo Java afecta nuestros Diseños
Cómo Java afecta nuestros Diseños
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
Writing Apps the Google-y Way (Brisbane)
Writing Apps the Google-y Way (Brisbane)Writing Apps the Google-y Way (Brisbane)
Writing Apps the Google-y Way (Brisbane)
 
Introducing new vocabulary
Introducing new vocabularyIntroducing new vocabulary
Introducing new vocabulary
 
Securing and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data MiningSecuring and Personalizing Commerce Using Identity Data Mining
Securing and Personalizing Commerce Using Identity Data Mining
 
EN Time Boxing Step by Step by Slidesgo.pptx
EN Time Boxing Step by Step by Slidesgo.pptxEN Time Boxing Step by Step by Slidesgo.pptx
EN Time Boxing Step by Step by Slidesgo.pptx
 
Housing Prediction
Housing PredictionHousing Prediction
Housing Prediction
 
PPT Slides Go.pptx
PPT Slides Go.pptxPPT Slides Go.pptx
PPT Slides Go.pptx
 
KIMIA MEDISINAL
KIMIA MEDISINALKIMIA MEDISINAL
KIMIA MEDISINAL
 
Test Assessment
Test AssessmentTest Assessment
Test Assessment
 

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

The Linguistics of Twitter - PyCon 2011 Presentation

  • 1. American English Regional Dialects Changing Speech Patterns Changing Online Measurement Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy @MichaelDHealy
  • 2.
  • 3.
  • 4. Introduction: Hawaiian Pidgin Video @MichaelDHealy
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Where We Stand @MichaelDHealy
  • 10. Where We Stand @MichaelDHealy
  • 11. Detailed Dialectical Map Detailed Dialectical Map http://aschmann.net/AmEng/
  • 12. Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? They Don't Speak The King's English! 1) America Doesn't Have A King
  • 13. Where We Stand @MichaelDHealy Wait! Isn't This All Just Poor English? 2) English Doesn't Have An Authority Like: French: L'AcadĂ©mie française Spanish: AsociaciĂłn de Academias de la Lengua Española Numerous Others: http://en.wikipedia.org/wiki/List_of_language_regulators
  • 14. Where We Stand @MichaelDHealy Who Is Right? Everyone Prescriptive Linguistics: Tell You What Is Right Descriptive Linguistics: Describe How You Communicate Trying To Sell More Widgets? Probably Descriptive Is Best
  • 15.
  • 16.
  • 17. Historical Context @MichaelDHealy Linguists Thought TV  Would Make Us All Sound The Same Think Tom Brokaw Area of 'Standard American English' Not Overly Large Not Largely Populated
  • 18. Historical Context @MichaelDHealy Been To Wisconsin? Seen Fargo? Biggest Change In Spoken English Since 1750 Going On Right Now - After TV 'Oh yeah? Yeah'
  • 19. Historical Context @MichaelDHealy Portions Of America Experience Some or All of Northern Cities Vowel Shift
  • 20. Historical Context @MichaelDHealy Sum This Up: People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects
  • 21. Historical Context @MichaelDHealy America Has Been Multi-Lingual Since July 9, 1776
  • 22.
  • 23. Where We May Be Going @MichaelDHealy
  • 24. Where We May Be Going @MichaelDHealy ~ 74% of Americans Live In A Megaregion Megaregions Tied To Existing Dialect Regions
  • 25. Where We May Be Going @MichaelDHealy William Labov, PhD. Professor of Linguistics University of Pennsylvania http://www.ling.upenn.edu/~wlabov/ Pretty Much The Authority on American English Dialects 'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'
  • 26.
  • 27.
  • 28. Potential Solutions @MichaelDHealy Center For Applied Linguistics. &quot;Thats the way baseball go.&quot;
  • 29. Potential Solutions @MichaelDHealy Correct the Spelling & Grammar import enchant from nltk.metrics import edit_distance class SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word) if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word
  • 30. Potential Solutions @MichaelDHealy Example 1 well im gonna go so i’ll talk to u lata  1 Corrected Example 1 Well mi Donna go so I'll talk to U late
  • 31. Potential Solutions @MichaelDHealy Build Out a Dictionary of Words Regex Match and Replace proper_words = { 'hater': ['enemy','jealous individual','not friend'] 'coke': ['coke', 'soda', 'pop'] } Which Region?
  • 32. Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1
  • 33. Potential Solutions @MichaelDHealy import re replacement_patterns = [ (r'gotta', 'got to'), (r&quot;iapos;ll&quot;, 'I will'), ('aight','all right') ] class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s
  • 34. Potential Solutions @MichaelDHealy Example 2 well i gotta go, i’ll talk to you later  aight  bye  1 well i got to go, I will talk to you later All right Bye 1 (!?)
  • 35. Potential Solutions @MichaelDHealy Example 2 well i got to go, I will talk to you later All right Bye 1 (!?) Here '1' has the concept of: I understand
  • 36. Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus First Step: Tag Existing Data import nltk.data tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') def tokenize(para): print tokenizer.tokenize(para)
  • 37. Potential Solutions @MichaelDHealy Solution? Bayesian Prediction Using a Custom Corpus Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol Tokenized as: 'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol' So lots of custom work to be done . .
  • 38. Potential Solutions @MichaelDHealy _andBeautyKills: – after tonight, don’t leave your boy roun’ me, umma #true playa fareal. Local To SF: Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol
  • 39. Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo import simplegeo.shared, simplegeo.places from simplegeo.shared import Feature client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret') properties = {&quot;province&quot;:&quot;CA&quot;,&quot;city&quot;:&quot;San Francisco&quot;,&quot;name&quot;:&quot;SimpleGeo SF&quot;, &quot;country&quot;:&quot;US&quot;, &quot;phone&quot;:&quot;+1 415 626 1375&quot;,&quot;address&quot;:&quot;41 Decatur St&quot;, &quot;postcode&quot;:&quot;94103&quot;} f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties) client.add_feature(f) 'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'
  • 40. Potential Solutions @MichaelDHealy Geographic Indexing SimpleGeo: Queries import simplegeo.places def start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word
  • 41. Potential Solutions: SimpleGeo-Tools @MichaelDHealy import simplegeo.places import simplegeo.context class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results
  • 42. Potential Solutions: Connect the APIS @MichaelDHealy
  • 43. References @MichaelDHealy Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/ A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J., O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October 2010. You are where you tweet: a content-based approach to geo-locating twitter users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management, 2010
  • 44. References @MichaelDHealy Repustate: Sentiment Analysis API http://repustate.com/ Rapleaf Personalization API https://www.rapleaf.com/ SimpleGeo GIS Solution API http://simplegeo.com/
  • 45. Michael D. Healy SimpleGeo-Tools @MichaelDHealy Michael D. Healy [email_address] http://michaeldhealy.com @MichaelDHealy SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools

Editor's Notes

  1. Potential Solutions Methodology via Peter Norvig Beautiful Data, Ch14
  2. Potential Solutions Methodology via Peter Norvig Beautiful Data, Ch14
  3. German translation of the Declaration of Independence 7/9/1776
  4. But What Can We Use As A Guide?
  5. Ebonics is not the correct terminology.
  6. Center For Applied Linguistics. &amp;quot; Like other dialects of English, AAE is a regular, systematic language variety that contrasts with other dialects in terms of its grammar, pronunciation, and vocabulary.&amp;quot;