SlideShare a Scribd company logo
1 of 31
Natural Language Processing
Qi Zhang
1
Agenda
• Natural Language Processing Background
• Methods used in NLP
• Applications
• Sentiment Analysis
• Usage in TripAdvisor
• Challenges
2
What is Natural Language Processing?
Text NLP
Structured
Data
Applications
• Machine Reading
3
Methods in NLP
• Automatic Summarization:
• There are basically two types of auctions.
• There are two types of auctions.
• Part-of-speech Tagging: classify and label words
• They refuse to permit us to obtain the refuse permit
• [('They', ‘pronouns'), ('refuse', verb'), ('to', prepositions'), ('permit', verb')…..]
• Entity Extraction:
• People, organizations, locations, times, dates, prices, …
• Relation Extraction:
• Located in, employed by, part of, married to, ...
4
Applications
• Machine Translation: Google Translate
• An electric guitar and bass player stand off…...
• fish as Pacific salmon and striped bass
• Email Spam Filters: Gmail
• Naive Bayes classifier is used to identify spam/ham emails
• P(spam|word) = P(word|spam)*P(spam)/P(word)
• Question-Answering: Amazon’s Alexa , Google Home
• Amazon Lex: AI Api used in Amazon’s Alexa
• Sentiment Analysis: Opinion Mining
5
Sentiment Analysis
• What is it?
• Determine the emotional tone behind a series of words
• Uses
• Political Polling: 2012 Presidential Election
• Business Purpose: TripAdvisor
6
Sentiment Analysis
Problem: How to identify whether a tweet is positive or negative
• Lexical Analysis
• ML Based Approach
7
Lexical Analysis
Input
Tweet
Tokenizer
8
Score: 0
Tokenization
• Input: Friends, Romans, Countrymen, lend me your ears;
• Output: Friends Romans Countrymen lend, me your ears
9
Lexical Analysis
List of
Tokens
Pre-tagged
Dictionary
Word
Matching
Match
?
Increment
Score
Decrement
Score
10
Score ++
Score --
Example
• “Beautiful impressionist paintings and outstanding sculptures. For
me, the original buildings were the best bit! The renovations and
creation of an amazing museum are a work of art in themselves.
Loved the paintings although a bit disappointed with the low number
of Van Gogh.” 😄
• Score: 0.301644
11
Example
beautiful impressionist and, outstanding ….
best ... amazing ...,love,...,disappoint,....
• Pre-Tagged Dictionary
• Positive:[beautiful, wonderful, best, outstanding, amazing, best, love ….]
• Negative: [disappoint, sad, unhappy.....]
• Score: 0.301644
12
Machine Learning Based Approach
Load & Pre-
Process Data
Extract
Features
Train Model
Evaluate
Model
13
ML Based Approach
• Load Data
• 25,000 labeled training tweets
• Another 25, 000 validation tweets
• 50,000 test tweets
14
ML Based Approach
• Pre-Process Data:
• Remove punctuation: “I like this one!!!!!” -> “I like this one”
• Filter out stopwords: “this”, “the”
• Normalize each contiguous occurrence of whitespace to ’ ‘: ” goodd” ->
“goodd”
• Convert to lowercase: “Upper” -> “upper”
• Stemming: “Learning” -> learn”, “Done” -> “do”
• Tokenization
15
ML Based Approach
• Extract Features
• Use Word2Vec model to map each word into an n-dimensional vector
• Each element of the vector can be viewed as a feature
16
What Is Word2Vec Model
• Use:
• Map the word into high dimensional ( > 100) vector
• Input: a large corpus of text
• Output: vector spaces: w=(w1,w2…..wn)
• Given a word, get the similar words
• Advantage:
• Preserve semantic relationship between each word
17
What Is Word2Vec Model
vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”)
18
man
woman
queen
king
What Is Word2Vec Model
• Use: Map the word into high dimensional ( > 100) vector
• Input: a large corpus of text
• Output: vector spaces: w=(w1,w2…..wn)
• Advantage:
• Preserve semantic relationship between each word
• Feature:
• “How Close” words or phrases are to each other
• The angle between the vectors of two words is an indicator of how similar
the words are
19
20
How To Train A Word2Vec Model?
• Build the model using Genism: Open source python toolkit
• model = Word2Vec(tweets, size=200, window=2, min_count=5, workers=4)
21
The quick brown fox jumps over the lazy dog.
How To Train A Word2Vec Model?
Source Text
22
The quick brown fox jumps over the lazy dog
Training Samples
( the, quick), (the, brown)
(quick, the), (quick, brown), (quick, fox)
(brown, the), (brown, quick),
(brown, fox), (brown, jumps)
(fox, quick), (fox, brown)
(fox, jumps), (fox, over)
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
How To Train A Word2Vec Model?
Source Text
23
The quick brown rabbit jumps out of the sink
Training Samples
( the, quick), (the, brown)
(quick, the), (quick, brown), (quick,
rabbit)
(brown, the), (brown, quick),
(brown, rabbit), (brown, jumps)
(rabbit, quick), (rabbit, brown)
(rabbit, jumps), (rabbit, out)
The quick brown rabbit jumps out of the sink
The quick brown rabbit jumps out of the sink
The quick brown rabbit jumps out of the sink
How To Train A Word2Vec Model?
For a given word: Rabbit, we get similar surrounding words of same
context:
• Input:
• tweet_w2v.most_similar(’rabbit')
• Output:
• [ (u’fox', 0.7355118989944458), (u’jump', 0.7164269685745239),..]
24
How To Train A Word2Vec Model?
• Input:
• tweet_w2v.most_similar(’good')
• Output:
• [(u'goood', 0.7355118989944458), (u'great', 0.7164269685745239),…]
25
Word2Vec Usage in TripAdvisor
26
User browser seq: Madrid, Lisbon, Barcelona,
Boston
Sentence: “Madrid, Lisbon, Barcelona, Boston”
ML Based Approach
• Train the Model
• Represent each word using Word2Vec
• Combine these word vectors
• Train the classifier
27
ML Based Approach
• Evaluate the Model
• Using the 50,000 test data to assess the model
• Accuracy: 0.78984528240986307
28
Challenges
• Some challenging examples
• “My flight’s been delayed. Brilliant! ☹️ (Sarcasm)
• “I do not dislike cabin cruisers.” (Negation handling)
• Some promising works, but still low accuracy
• Contextualized Sarcasm Detection on Twitter - David Bamman and Noah A.
Smith
29
• Online course:
• https://www.coursera.org/learn/natural-language-processing
• Open resource:
• https://nlp.stanford.edu/ : Standford NLP group
• https://arxiv.org/
30
Thank you!
31

More Related Content

Similar to NLP

Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick ProsserPierre Schaus
 
What Questions Are Worth Answering?
What Questions Are Worth Answering?What Questions Are Worth Answering?
What Questions Are Worth Answering?Ehren Reilly
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsTejas Patil
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
PyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynotePyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynoteDaniel Greenfeld
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
An Introduction To Python - Variables, Math
An Introduction To Python - Variables, MathAn Introduction To Python - Variables, Math
An Introduction To Python - Variables, MathBlue Elephant Consulting
 
Refactoring RIA Unleashed 2011
Refactoring RIA Unleashed 2011Refactoring RIA Unleashed 2011
Refactoring RIA Unleashed 2011Jesse Warden
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02Jeffrey Clark
 
Text Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningText Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningAdrian Cuyugan
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost MaintainabilityMosky Liu
 
SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.Splunk
 
AstriCon 2017 - Machine Learning, AI & Asterisk
AstriCon 2017  - Machine Learning, AI & AsteriskAstriCon 2017  - Machine Learning, AI & Asterisk
AstriCon 2017 - Machine Learning, AI & AsteriskEvan McGee
 

Similar to NLP (20)

Teaching Constraint Programming, Patrick Prosser
Teaching Constraint Programming,  Patrick ProsserTeaching Constraint Programming,  Patrick Prosser
Teaching Constraint Programming, Patrick Prosser
 
What Questions Are Worth Answering?
What Questions Are Worth Answering?What Questions Are Worth Answering?
What Questions Are Worth Answering?
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applications
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
PyCon Philippines 2012 Keynote
PyCon Philippines 2012 KeynotePyCon Philippines 2012 Keynote
PyCon Philippines 2012 Keynote
 
bp
bpbp
bp
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Petermrjisc20141201
Petermrjisc20141201Petermrjisc20141201
Petermrjisc20141201
 
An Introduction To Python - Variables, Math
An Introduction To Python - Variables, MathAn Introduction To Python - Variables, Math
An Introduction To Python - Variables, Math
 
Refactoring RIA Unleashed 2011
Refactoring RIA Unleashed 2011Refactoring RIA Unleashed 2011
Refactoring RIA Unleashed 2011
 
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02Genericmeetupslides 110607190400-phpapp02
Genericmeetupslides 110607190400-phpapp02
 
Text Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree LearningText Mining, Association Rules and Decision Tree Learning
Text Mining, Association Rules and Decision Tree Learning
 
1004-nlp.ppt
1004-nlp.ppt1004-nlp.ppt
1004-nlp.ppt
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
 
SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.SplunkLive! Customer Presentation - Cisco Systems, Inc.
SplunkLive! Customer Presentation - Cisco Systems, Inc.
 
AstriCon 2017 - Machine Learning, AI & Asterisk
AstriCon 2017  - Machine Learning, AI & AsteriskAstriCon 2017  - Machine Learning, AI & Asterisk
AstriCon 2017 - Machine Learning, AI & Asterisk
 

Recently uploaded

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 

Recently uploaded (20)

Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 

NLP

  • 2. Agenda • Natural Language Processing Background • Methods used in NLP • Applications • Sentiment Analysis • Usage in TripAdvisor • Challenges 2
  • 3. What is Natural Language Processing? Text NLP Structured Data Applications • Machine Reading 3
  • 4. Methods in NLP • Automatic Summarization: • There are basically two types of auctions. • There are two types of auctions. • Part-of-speech Tagging: classify and label words • They refuse to permit us to obtain the refuse permit • [('They', ‘pronouns'), ('refuse', verb'), ('to', prepositions'), ('permit', verb')…..] • Entity Extraction: • People, organizations, locations, times, dates, prices, … • Relation Extraction: • Located in, employed by, part of, married to, ... 4
  • 5. Applications • Machine Translation: Google Translate • An electric guitar and bass player stand off…... • fish as Pacific salmon and striped bass • Email Spam Filters: Gmail • Naive Bayes classifier is used to identify spam/ham emails • P(spam|word) = P(word|spam)*P(spam)/P(word) • Question-Answering: Amazon’s Alexa , Google Home • Amazon Lex: AI Api used in Amazon’s Alexa • Sentiment Analysis: Opinion Mining 5
  • 6. Sentiment Analysis • What is it? • Determine the emotional tone behind a series of words • Uses • Political Polling: 2012 Presidential Election • Business Purpose: TripAdvisor 6
  • 7. Sentiment Analysis Problem: How to identify whether a tweet is positive or negative • Lexical Analysis • ML Based Approach 7
  • 9. Tokenization • Input: Friends, Romans, Countrymen, lend me your ears; • Output: Friends Romans Countrymen lend, me your ears 9
  • 11. Example • “Beautiful impressionist paintings and outstanding sculptures. For me, the original buildings were the best bit! The renovations and creation of an amazing museum are a work of art in themselves. Loved the paintings although a bit disappointed with the low number of Van Gogh.” 😄 • Score: 0.301644 11
  • 12. Example beautiful impressionist and, outstanding …. best ... amazing ...,love,...,disappoint,.... • Pre-Tagged Dictionary • Positive:[beautiful, wonderful, best, outstanding, amazing, best, love ….] • Negative: [disappoint, sad, unhappy.....] • Score: 0.301644 12
  • 13. Machine Learning Based Approach Load & Pre- Process Data Extract Features Train Model Evaluate Model 13
  • 14. ML Based Approach • Load Data • 25,000 labeled training tweets • Another 25, 000 validation tweets • 50,000 test tweets 14
  • 15. ML Based Approach • Pre-Process Data: • Remove punctuation: “I like this one!!!!!” -> “I like this one” • Filter out stopwords: “this”, “the” • Normalize each contiguous occurrence of whitespace to ’ ‘: ” goodd” -> “goodd” • Convert to lowercase: “Upper” -> “upper” • Stemming: “Learning” -> learn”, “Done” -> “do” • Tokenization 15
  • 16. ML Based Approach • Extract Features • Use Word2Vec model to map each word into an n-dimensional vector • Each element of the vector can be viewed as a feature 16
  • 17. What Is Word2Vec Model • Use: • Map the word into high dimensional ( > 100) vector • Input: a large corpus of text • Output: vector spaces: w=(w1,w2…..wn) • Given a word, get the similar words • Advantage: • Preserve semantic relationship between each word 17
  • 18. What Is Word2Vec Model vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”) 18 man woman queen king
  • 19. What Is Word2Vec Model • Use: Map the word into high dimensional ( > 100) vector • Input: a large corpus of text • Output: vector spaces: w=(w1,w2…..wn) • Advantage: • Preserve semantic relationship between each word • Feature: • “How Close” words or phrases are to each other • The angle between the vectors of two words is an indicator of how similar the words are 19
  • 20. 20
  • 21. How To Train A Word2Vec Model? • Build the model using Genism: Open source python toolkit • model = Word2Vec(tweets, size=200, window=2, min_count=5, workers=4) 21 The quick brown fox jumps over the lazy dog.
  • 22. How To Train A Word2Vec Model? Source Text 22 The quick brown fox jumps over the lazy dog Training Samples ( the, quick), (the, brown) (quick, the), (quick, brown), (quick, fox) (brown, the), (brown, quick), (brown, fox), (brown, jumps) (fox, quick), (fox, brown) (fox, jumps), (fox, over) The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog The quick brown fox jumps over the lazy dog
  • 23. How To Train A Word2Vec Model? Source Text 23 The quick brown rabbit jumps out of the sink Training Samples ( the, quick), (the, brown) (quick, the), (quick, brown), (quick, rabbit) (brown, the), (brown, quick), (brown, rabbit), (brown, jumps) (rabbit, quick), (rabbit, brown) (rabbit, jumps), (rabbit, out) The quick brown rabbit jumps out of the sink The quick brown rabbit jumps out of the sink The quick brown rabbit jumps out of the sink
  • 24. How To Train A Word2Vec Model? For a given word: Rabbit, we get similar surrounding words of same context: • Input: • tweet_w2v.most_similar(’rabbit') • Output: • [ (u’fox', 0.7355118989944458), (u’jump', 0.7164269685745239),..] 24
  • 25. How To Train A Word2Vec Model? • Input: • tweet_w2v.most_similar(’good') • Output: • [(u'goood', 0.7355118989944458), (u'great', 0.7164269685745239),…] 25
  • 26. Word2Vec Usage in TripAdvisor 26 User browser seq: Madrid, Lisbon, Barcelona, Boston Sentence: “Madrid, Lisbon, Barcelona, Boston”
  • 27. ML Based Approach • Train the Model • Represent each word using Word2Vec • Combine these word vectors • Train the classifier 27
  • 28. ML Based Approach • Evaluate the Model • Using the 50,000 test data to assess the model • Accuracy: 0.78984528240986307 28
  • 29. Challenges • Some challenging examples • “My flight’s been delayed. Brilliant! ☹️ (Sarcasm) • “I do not dislike cabin cruisers.” (Negation handling) • Some promising works, but still low accuracy • Contextualized Sarcasm Detection on Twitter - David Bamman and Noah A. Smith 29
  • 30. • Online course: • https://www.coursera.org/learn/natural-language-processing • Open resource: • https://nlp.stanford.edu/ : Standford NLP group • https://arxiv.org/ 30

Editor's Notes

  1. This is a example of tokenization
  2. where each tweet is labeled 1 when it's positive and 0 when it's negative Validation tweet are used to tune the model. Prevent overfitting, neural networking is used to train the hidden output layer.
  3. For example, patterns such as “Man is to Woman as King is to Queen” can be generated through algebraic operations on the vector representations of these words such that the vector representation of “Brother” - ”Man” + ”Woman” produces a result which is closest to the vector representation of “Sister” in the model The vector offset is pretty much parallel to each other
  4. After we have some knowledge to word2vec. Let me continue with how to train a Word2vec model? The common way is to use Genisum.. Then calling this will build a model for us. Feeding this model by a large corpus of sentences, which is used to build a vocabulary. The size is the word vector dimension. min_count = ignore all words with total frequency lower than this. wordkers: use this many worker threads to train the model: thread. Because the text corpus are really large, so I set the thread to be 4. The window is the maximum distance between the current and predicted word within a sentence. If we set the window size = 2, and dimension to be 200? How it works? Let me demonstrate this with only 1 input sentence: Size is size is the dimensionality of the feature vectors. Window: window is the maximum distance between the current and predicted word within a sentence. Given a specific word in the middle of a sentence (the input word), look at the words nearby. The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word. For example, if you gave the trained network the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo”. min_count = ignore all words with total frequency lower than this. wordkers: use this many worker threads to train the model: thread
  5. Tripadvisor recommendation use word2Vec model. For example, a user’s brwoser sequence is “ Madrid./…..” which means, this user actually search/browser Madird, then Boston..... so we can make up a sentence by the user’s browser sequence; The sentece we will use to feed the word2vec model is: Madrid, Lisbon,…” Like we do for The quick brown fox jumps over the lazy dog. after feeding many such sentences from different users, it learns pretty well how geos are similar in meaning! Then after I booked a vacational rentals in Boston, it will also recommend other places in Spain.
  6. It is hard for people Sarcasm is dependent on its context They think the the relationship between author and audience is central for understanding the sarcasm phenomenon. Promising work: looks at attributes of the author (author features), attributes of the intended recipient of a tweet (audience features), and the attributes of responses to potentially sarcastic tweets (response features). use of grammatical relations among words to model a sentence, and hence to determine words that are affected by negation.  static window and punctuation marks to determine the scope of negation. Using natural language processing to detect sarcasm on the internet still has a long way to go and may never be particularly reliable
  7. Feel free to ask me any questions.