SlideShare a Scribd company logo
Natural Language Processing
(NLP) techniques for structuring
large volumes of human text data
Alessandra Sozzi, Kimberley Brett
Office for National Statistics
Overview
• Introduction to NLP and context of use within
ONS
• Property data: an example of NLP and
machine learning
• Sentiment analysis of text:
• Automating internal feedback
• Understanding daily public satisfaction
What is Natural Language Processing (NLP)
• Using computer algorithms and code to
understand, and sometimes classify, large
volumes of unstructured human text.
• Can help to automate analysis previously
done by hand
• Useful in government as there are many free
text fields with rich information
Property websites: Zoopla
Project: Intelligence from housing data
• Supplement address register information to
provide insight for census field staff
• Pilot (Karen Gask): Used Zoopla API to
identify caravan properties
• Caravans: inconsistently recorded in other
data sources
• Natural Language Processing and Machine
learning approaches in Python
Training
• Binary features created from the property
description and property type
• Data split into 80% training, 20% testing
• Tested on Machine learning algorithms:
Logistic regression, Decision trees, Random forests,
Support Vector Machines
• Evaluation: F1 scores and cross validation
Testing
• Support Vector machines performed best in
training
• Tested on SVM, attaining F1 score ~0.917
• Of these:
34/51 in exact location on address register
11 in nearby location
6 not on address register – valuable additions
Pilot extended
• Acquired larger Zoopla data and using similar
methods, focus on SVM approach
• Census test areas:
Blackpool, Barnsley & Sheffield, Southwark, West
Dorset & South Somerset, Northern Powys
• Further investigation:
• Whether caravan is residential/ holiday home
• Gated communities and retirement properties.
Issues
• Data not available for whole of UK as not all
advertised via Zoopla
• Not all have description
• Census test areas: Other LAs may be more/ less
likely to have those property types
• Time to acquire the data, data cleaning etc
• Estate agents embellish descriptions
• Spelling: data may have been input in a rush
Sentiment analysis: Projects
• Project with EuroStat: sentiment analysis of
public forums
• Blogs, comments on news sites, social media
• Undertaken by ONS colleagues; Alessandra Sozzi and
Charles Morris
• Internal project:
• Sentiment analysis of feedback responses from
an internal talk
Sentiment analysis
• Type of Natural Language Processing
• Positive or negative sentiment
• Analyse different emotions
• Plutchik’s eight emotions
Anger
Trust
Surprise
Joy
Fear
Disgust
Anticipation
Sadness
Approaches
• Lexicon-based
• Corpus of words rated by sentiment expressed
• Text run through this corpus and given ratings
• Machine learning
• Builds on the lexicon based approach to learn based on
ratings in a test set.
• Clerically reviewed gold standards
• Essential to evaluate performance
Different lexicons
• Many different lexicons, but the following
have been used in our analysis:
• NRC
• Very popular. Contains about 14,000 rated words. Scale
between -1 and 1.
• Bing
• Contains around 6,000 words. Scale between -1 and 1.
• AFINN
• Contains about 4,000 words. Scale between -5 to 5.
• Syuzhet
VADER
• Problem with other lexicons: Negations and
boosters
• VADER: Python based lexicon and sentiment
analysis package. Contains only ~6,000 rated
words but does address negations and
boosters
Model overview
4 different lexicons +
VADER
Lexicon Comparison over Time
• Facebook comments to the Guardian Facebook page over the period of
approx. one month (27th Feb – 31st March)
• Sentiment calculated using 4 different lexicons + VADER. Scores are
normalised from -1 to 1
• 24h MA: While a moving average is useful to remove noise, data on the edges
is lost and thus the sentiment tend to level off. Nevertheless, such smoothing
can be useful for getting a sense of the emotional trajectory.
Commonalities in
the sentiment
trajectory exist
between the
lexicons, which is
good
VADER: positive vs. negative
sentiment trajectories
Big jump on the
positive
sentiment due**
to MasterChef
Big jump in the
negative sentiment
due** to the
terrorist attack in
Westminster.
**Currently working to detect
significant changes in sentiment and
identify which are the comments/posts
contributing the most to it.
Problems
• Long text
• Noisy comments: many comments with just a name in it
• Context relevant
• Keyword-based approach is totally based on the set of
keywords. Sentences without any keyword would imply
that they do not carry any sentiment at all.
• Meanings of keywords could be multiple and vague, as
most words could change their meanings according to
different usages and contexts.
Sentiment in longer texts
Lexicon-based sentiment analysis is known to work better with short text,
such as tweets from Twitter, which are short and thus usually
straight to the point.
Sentiment analysis for
discussions,
comments, and blogs
tend to be a much
harder task, since they
generally involve
multiple entities,
multiple opinions,
comparisons, noise,
sarcasm, etc. The
longer the text, the
more neutral the
sentiment tend to be.
Internal feedback responses
• Lexicon approach only moderate success as
domain specific text not always expressing
sentiment keywords
• Machine learning:
1. Pre-processing
2. Feature extraction
3. Classification
4. Evaluation
• 15-20% improvement on Lexicon approach
NLTK
Where to now?
• Further exploration using Scikit learn
• Distributional Semantics (word2vec , Glove)
Using python packages gensim / spacy
• Deep learning https://blog.openai.com/unsupervised-sentiment-neuron/
Further Information
• Big Data Team
www.ons.gov.uk/aboutus/whatwedo/programmesandprojects/theonsbigdataproject
• Big data team GitHub:
• https://github.com/ONSBigData
• Emails:
• ons.big.data.project@ons.gov.uk
• Alessandra.sozzi@ons.gsi.gov.uk
• kimberley.brett@ons.gov.gsi.uk
• With thanks to Theodore Manassis, Charles Morris and Karen Gask

More Related Content

Similar to Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt

A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
UNCResearchHub
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
Nathan McMinn
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Hady Elsahar
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
saivinay93
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptx
Softxai
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
ShowravDuttaAnkur
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 Plus
Shalin Hai-Jew
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
Ali Arsalan Kazmi
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
arcomem
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
buivantan_uneti
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Amazon Web Services
 
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).ppt
TemesgenTolcha2
 
Perceptual Data_04182016
Perceptual Data_04182016Perceptual Data_04182016
Perceptual Data_04182016Kunal Dash
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
PriyadharshiniG41
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
PriyadharshiniG41
 

Similar to Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt (20)

A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptx
 
Fypca4
Fypca4Fypca4
Fypca4
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 Plus
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Fypca4
Fypca4Fypca4
Fypca4
 
Fypca4
Fypca4Fypca4
Fypca4
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).ppt
 
Perceptual Data_04182016
Perceptual Data_04182016Perceptual Data_04182016
Perceptual Data_04182016
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 

Recently uploaded

Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 

Recently uploaded (20)

Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 

Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt

  • 1. Natural Language Processing (NLP) techniques for structuring large volumes of human text data Alessandra Sozzi, Kimberley Brett Office for National Statistics
  • 2. Overview • Introduction to NLP and context of use within ONS • Property data: an example of NLP and machine learning • Sentiment analysis of text: • Automating internal feedback • Understanding daily public satisfaction
  • 3. What is Natural Language Processing (NLP) • Using computer algorithms and code to understand, and sometimes classify, large volumes of unstructured human text. • Can help to automate analysis previously done by hand • Useful in government as there are many free text fields with rich information
  • 5. Project: Intelligence from housing data • Supplement address register information to provide insight for census field staff • Pilot (Karen Gask): Used Zoopla API to identify caravan properties • Caravans: inconsistently recorded in other data sources • Natural Language Processing and Machine learning approaches in Python
  • 6. Training • Binary features created from the property description and property type • Data split into 80% training, 20% testing • Tested on Machine learning algorithms: Logistic regression, Decision trees, Random forests, Support Vector Machines • Evaluation: F1 scores and cross validation
  • 7. Testing • Support Vector machines performed best in training • Tested on SVM, attaining F1 score ~0.917 • Of these: 34/51 in exact location on address register 11 in nearby location 6 not on address register – valuable additions
  • 8. Pilot extended • Acquired larger Zoopla data and using similar methods, focus on SVM approach • Census test areas: Blackpool, Barnsley & Sheffield, Southwark, West Dorset & South Somerset, Northern Powys • Further investigation: • Whether caravan is residential/ holiday home • Gated communities and retirement properties.
  • 9. Issues • Data not available for whole of UK as not all advertised via Zoopla • Not all have description • Census test areas: Other LAs may be more/ less likely to have those property types • Time to acquire the data, data cleaning etc • Estate agents embellish descriptions • Spelling: data may have been input in a rush
  • 10. Sentiment analysis: Projects • Project with EuroStat: sentiment analysis of public forums • Blogs, comments on news sites, social media • Undertaken by ONS colleagues; Alessandra Sozzi and Charles Morris • Internal project: • Sentiment analysis of feedback responses from an internal talk
  • 11. Sentiment analysis • Type of Natural Language Processing • Positive or negative sentiment • Analyse different emotions • Plutchik’s eight emotions Anger Trust Surprise Joy Fear Disgust Anticipation Sadness
  • 12. Approaches • Lexicon-based • Corpus of words rated by sentiment expressed • Text run through this corpus and given ratings • Machine learning • Builds on the lexicon based approach to learn based on ratings in a test set. • Clerically reviewed gold standards • Essential to evaluate performance
  • 13. Different lexicons • Many different lexicons, but the following have been used in our analysis: • NRC • Very popular. Contains about 14,000 rated words. Scale between -1 and 1. • Bing • Contains around 6,000 words. Scale between -1 and 1. • AFINN • Contains about 4,000 words. Scale between -5 to 5. • Syuzhet
  • 14. VADER • Problem with other lexicons: Negations and boosters • VADER: Python based lexicon and sentiment analysis package. Contains only ~6,000 rated words but does address negations and boosters
  • 15. Model overview 4 different lexicons + VADER
  • 16. Lexicon Comparison over Time • Facebook comments to the Guardian Facebook page over the period of approx. one month (27th Feb – 31st March) • Sentiment calculated using 4 different lexicons + VADER. Scores are normalised from -1 to 1 • 24h MA: While a moving average is useful to remove noise, data on the edges is lost and thus the sentiment tend to level off. Nevertheless, such smoothing can be useful for getting a sense of the emotional trajectory. Commonalities in the sentiment trajectory exist between the lexicons, which is good
  • 17. VADER: positive vs. negative sentiment trajectories Big jump on the positive sentiment due** to MasterChef Big jump in the negative sentiment due** to the terrorist attack in Westminster. **Currently working to detect significant changes in sentiment and identify which are the comments/posts contributing the most to it.
  • 18. Problems • Long text • Noisy comments: many comments with just a name in it • Context relevant • Keyword-based approach is totally based on the set of keywords. Sentences without any keyword would imply that they do not carry any sentiment at all. • Meanings of keywords could be multiple and vague, as most words could change their meanings according to different usages and contexts.
  • 19. Sentiment in longer texts Lexicon-based sentiment analysis is known to work better with short text, such as tweets from Twitter, which are short and thus usually straight to the point. Sentiment analysis for discussions, comments, and blogs tend to be a much harder task, since they generally involve multiple entities, multiple opinions, comparisons, noise, sarcasm, etc. The longer the text, the more neutral the sentiment tend to be.
  • 20. Internal feedback responses • Lexicon approach only moderate success as domain specific text not always expressing sentiment keywords • Machine learning: 1. Pre-processing 2. Feature extraction 3. Classification 4. Evaluation • 15-20% improvement on Lexicon approach NLTK
  • 21. Where to now? • Further exploration using Scikit learn • Distributional Semantics (word2vec , Glove) Using python packages gensim / spacy • Deep learning https://blog.openai.com/unsupervised-sentiment-neuron/
  • 22. Further Information • Big Data Team www.ons.gov.uk/aboutus/whatwedo/programmesandprojects/theonsbigdataproject • Big data team GitHub: • https://github.com/ONSBigData • Emails: • ons.big.data.project@ons.gov.uk • Alessandra.sozzi@ons.gsi.gov.uk • kimberley.brett@ons.gov.gsi.uk • With thanks to Theodore Manassis, Charles Morris and Karen Gask