SlideShare a Scribd company logo
1 of 22
Natural Language Processing
(NLP) techniques for structuring
large volumes of human text data
Alessandra Sozzi, Kimberley Brett
Office for National Statistics
Overview
• Introduction to NLP and context of use within
ONS
• Property data: an example of NLP and
machine learning
• Sentiment analysis of text:
• Automating internal feedback
• Understanding daily public satisfaction
What is Natural Language Processing (NLP)
• Using computer algorithms and code to
understand, and sometimes classify, large
volumes of unstructured human text.
• Can help to automate analysis previously
done by hand
• Useful in government as there are many free
text fields with rich information
Property websites: Zoopla
Project: Intelligence from housing data
• Supplement address register information to
provide insight for census field staff
• Pilot (Karen Gask): Used Zoopla API to
identify caravan properties
• Caravans: inconsistently recorded in other
data sources
• Natural Language Processing and Machine
learning approaches in Python
Training
• Binary features created from the property
description and property type
• Data split into 80% training, 20% testing
• Tested on Machine learning algorithms:
Logistic regression, Decision trees, Random forests,
Support Vector Machines
• Evaluation: F1 scores and cross validation
Testing
• Support Vector machines performed best in
training
• Tested on SVM, attaining F1 score ~0.917
• Of these:
34/51 in exact location on address register
11 in nearby location
6 not on address register – valuable additions
Pilot extended
• Acquired larger Zoopla data and using similar
methods, focus on SVM approach
• Census test areas:
Blackpool, Barnsley & Sheffield, Southwark, West
Dorset & South Somerset, Northern Powys
• Further investigation:
• Whether caravan is residential/ holiday home
• Gated communities and retirement properties.
Issues
• Data not available for whole of UK as not all
advertised via Zoopla
• Not all have description
• Census test areas: Other LAs may be more/ less
likely to have those property types
• Time to acquire the data, data cleaning etc
• Estate agents embellish descriptions
• Spelling: data may have been input in a rush
Sentiment analysis: Projects
• Project with EuroStat: sentiment analysis of
public forums
• Blogs, comments on news sites, social media
• Undertaken by ONS colleagues; Alessandra Sozzi and
Charles Morris
• Internal project:
• Sentiment analysis of feedback responses from
an internal talk
Sentiment analysis
• Type of Natural Language Processing
• Positive or negative sentiment
• Analyse different emotions
• Plutchik’s eight emotions
Anger
Trust
Surprise
Joy
Fear
Disgust
Anticipation
Sadness
Approaches
• Lexicon-based
• Corpus of words rated by sentiment expressed
• Text run through this corpus and given ratings
• Machine learning
• Builds on the lexicon based approach to learn based on
ratings in a test set.
• Clerically reviewed gold standards
• Essential to evaluate performance
Different lexicons
• Many different lexicons, but the following
have been used in our analysis:
• NRC
• Very popular. Contains about 14,000 rated words. Scale
between -1 and 1.
• Bing
• Contains around 6,000 words. Scale between -1 and 1.
• AFINN
• Contains about 4,000 words. Scale between -5 to 5.
• Syuzhet
VADER
• Problem with other lexicons: Negations and
boosters
• VADER: Python based lexicon and sentiment
analysis package. Contains only ~6,000 rated
words but does address negations and
boosters
Model overview
4 different lexicons +
VADER
Lexicon Comparison over Time
• Facebook comments to the Guardian Facebook page over the period of
approx. one month (27th Feb – 31st March)
• Sentiment calculated using 4 different lexicons + VADER. Scores are
normalised from -1 to 1
• 24h MA: While a moving average is useful to remove noise, data on the edges
is lost and thus the sentiment tend to level off. Nevertheless, such smoothing
can be useful for getting a sense of the emotional trajectory.
Commonalities in
the sentiment
trajectory exist
between the
lexicons, which is
good
VADER: positive vs. negative
sentiment trajectories
Big jump on the
positive
sentiment due**
to MasterChef
Big jump in the
negative sentiment
due** to the
terrorist attack in
Westminster.
**Currently working to detect
significant changes in sentiment and
identify which are the comments/posts
contributing the most to it.
Problems
• Long text
• Noisy comments: many comments with just a name in it
• Context relevant
• Keyword-based approach is totally based on the set of
keywords. Sentences without any keyword would imply
that they do not carry any sentiment at all.
• Meanings of keywords could be multiple and vague, as
most words could change their meanings according to
different usages and contexts.
Sentiment in longer texts
Lexicon-based sentiment analysis is known to work better with short text,
such as tweets from Twitter, which are short and thus usually
straight to the point.
Sentiment analysis for
discussions,
comments, and blogs
tend to be a much
harder task, since they
generally involve
multiple entities,
multiple opinions,
comparisons, noise,
sarcasm, etc. The
longer the text, the
more neutral the
sentiment tend to be.
Internal feedback responses
• Lexicon approach only moderate success as
domain specific text not always expressing
sentiment keywords
• Machine learning:
1. Pre-processing
2. Feature extraction
3. Classification
4. Evaluation
• 15-20% improvement on Lexicon approach
NLTK
Where to now?
• Further exploration using Scikit learn
• Distributional Semantics (word2vec , Glove)
Using python packages gensim / spacy
• Deep learning https://blog.openai.com/unsupervised-sentiment-neuron/
Further Information
• Big Data Team
www.ons.gov.uk/aboutus/whatwedo/programmesandprojects/theonsbigdataproject
• Big data team GitHub:
• https://github.com/ONSBigData
• Emails:
• ons.big.data.project@ons.gov.uk
• Alessandra.sozzi@ons.gsi.gov.uk
• kimberley.brett@ons.gov.gsi.uk
• With thanks to Theodore Manassis, Charles Morris and Karen Gask

More Related Content

Similar to Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt

A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IUNCResearchHub
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxsaivinay93
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxSoftxai
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptxShowravDuttaAnkur
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusShalin Hai-Jew
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Ali Arsalan Kazmi
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advancedarcomem
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptxbuivantan_uneti
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Amazon Web Services
 
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptTemesgenTolcha2
 
Perceptual Data_04182016
Perceptual Data_04182016Perceptual Data_04182016
Perceptual Data_04182016Kunal Dash
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptxPriyadharshiniG41
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptxPriyadharshiniG41
 

Similar to Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt (20)

A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
NATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptxNATURAL LANGUAGE PROCESSING.pptx
NATURAL LANGUAGE PROCESSING.pptx
 
An Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptxAn Overview of Natural Language Processing.pptx
An Overview of Natural Language Processing.pptx
 
Fypca4
Fypca4Fypca4
Fypca4
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 Plus
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Fypca4
Fypca4Fypca4
Fypca4
 
Fypca4
Fypca4Fypca4
Fypca4
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
Natural Language Processing for Data Analytics - Tel Aviv Summit 2018
 
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).ppt
 
Perceptual Data_04182016
Perceptual Data_04182016Perceptual Data_04182016
Perceptual Data_04182016
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 
Natural Language Processing.pptx
Natural Language Processing.pptxNatural Language Processing.pptx
Natural Language Processing.pptx
 

Recently uploaded

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111GangaMaiya1
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSAnaAcapella
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesSHIVANANDaRV
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17Celine George
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxakanksha16arora
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17Celine George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 

Recently uploaded (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111Details on CBSE Compartment Exam.pptx1111
Details on CBSE Compartment Exam.pptx1111
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
Economic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food AdditivesEconomic Importance Of Fungi In Food Additives
Economic Importance Of Fungi In Food Additives
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17How to Manage Call for Tendor in Odoo 17
How to Manage Call for Tendor in Odoo 17
 
PANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptxPANDITA RAMABAI- Indian political thought GENDER.pptx
PANDITA RAMABAI- Indian political thought GENDER.pptx
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17How to Add a Tool Tip to a Field in Odoo 17
How to Add a Tool Tip to a Field in Odoo 17
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 

Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Brett-ONS.ppt

  • 1. Natural Language Processing (NLP) techniques for structuring large volumes of human text data Alessandra Sozzi, Kimberley Brett Office for National Statistics
  • 2. Overview • Introduction to NLP and context of use within ONS • Property data: an example of NLP and machine learning • Sentiment analysis of text: • Automating internal feedback • Understanding daily public satisfaction
  • 3. What is Natural Language Processing (NLP) • Using computer algorithms and code to understand, and sometimes classify, large volumes of unstructured human text. • Can help to automate analysis previously done by hand • Useful in government as there are many free text fields with rich information
  • 5. Project: Intelligence from housing data • Supplement address register information to provide insight for census field staff • Pilot (Karen Gask): Used Zoopla API to identify caravan properties • Caravans: inconsistently recorded in other data sources • Natural Language Processing and Machine learning approaches in Python
  • 6. Training • Binary features created from the property description and property type • Data split into 80% training, 20% testing • Tested on Machine learning algorithms: Logistic regression, Decision trees, Random forests, Support Vector Machines • Evaluation: F1 scores and cross validation
  • 7. Testing • Support Vector machines performed best in training • Tested on SVM, attaining F1 score ~0.917 • Of these: 34/51 in exact location on address register 11 in nearby location 6 not on address register – valuable additions
  • 8. Pilot extended • Acquired larger Zoopla data and using similar methods, focus on SVM approach • Census test areas: Blackpool, Barnsley & Sheffield, Southwark, West Dorset & South Somerset, Northern Powys • Further investigation: • Whether caravan is residential/ holiday home • Gated communities and retirement properties.
  • 9. Issues • Data not available for whole of UK as not all advertised via Zoopla • Not all have description • Census test areas: Other LAs may be more/ less likely to have those property types • Time to acquire the data, data cleaning etc • Estate agents embellish descriptions • Spelling: data may have been input in a rush
  • 10. Sentiment analysis: Projects • Project with EuroStat: sentiment analysis of public forums • Blogs, comments on news sites, social media • Undertaken by ONS colleagues; Alessandra Sozzi and Charles Morris • Internal project: • Sentiment analysis of feedback responses from an internal talk
  • 11. Sentiment analysis • Type of Natural Language Processing • Positive or negative sentiment • Analyse different emotions • Plutchik’s eight emotions Anger Trust Surprise Joy Fear Disgust Anticipation Sadness
  • 12. Approaches • Lexicon-based • Corpus of words rated by sentiment expressed • Text run through this corpus and given ratings • Machine learning • Builds on the lexicon based approach to learn based on ratings in a test set. • Clerically reviewed gold standards • Essential to evaluate performance
  • 13. Different lexicons • Many different lexicons, but the following have been used in our analysis: • NRC • Very popular. Contains about 14,000 rated words. Scale between -1 and 1. • Bing • Contains around 6,000 words. Scale between -1 and 1. • AFINN • Contains about 4,000 words. Scale between -5 to 5. • Syuzhet
  • 14. VADER • Problem with other lexicons: Negations and boosters • VADER: Python based lexicon and sentiment analysis package. Contains only ~6,000 rated words but does address negations and boosters
  • 15. Model overview 4 different lexicons + VADER
  • 16. Lexicon Comparison over Time • Facebook comments to the Guardian Facebook page over the period of approx. one month (27th Feb – 31st March) • Sentiment calculated using 4 different lexicons + VADER. Scores are normalised from -1 to 1 • 24h MA: While a moving average is useful to remove noise, data on the edges is lost and thus the sentiment tend to level off. Nevertheless, such smoothing can be useful for getting a sense of the emotional trajectory. Commonalities in the sentiment trajectory exist between the lexicons, which is good
  • 17. VADER: positive vs. negative sentiment trajectories Big jump on the positive sentiment due** to MasterChef Big jump in the negative sentiment due** to the terrorist attack in Westminster. **Currently working to detect significant changes in sentiment and identify which are the comments/posts contributing the most to it.
  • 18. Problems • Long text • Noisy comments: many comments with just a name in it • Context relevant • Keyword-based approach is totally based on the set of keywords. Sentences without any keyword would imply that they do not carry any sentiment at all. • Meanings of keywords could be multiple and vague, as most words could change their meanings according to different usages and contexts.
  • 19. Sentiment in longer texts Lexicon-based sentiment analysis is known to work better with short text, such as tweets from Twitter, which are short and thus usually straight to the point. Sentiment analysis for discussions, comments, and blogs tend to be a much harder task, since they generally involve multiple entities, multiple opinions, comparisons, noise, sarcasm, etc. The longer the text, the more neutral the sentiment tend to be.
  • 20. Internal feedback responses • Lexicon approach only moderate success as domain specific text not always expressing sentiment keywords • Machine learning: 1. Pre-processing 2. Feature extraction 3. Classification 4. Evaluation • 15-20% improvement on Lexicon approach NLTK
  • 21. Where to now? • Further exploration using Scikit learn • Distributional Semantics (word2vec , Glove) Using python packages gensim / spacy • Deep learning https://blog.openai.com/unsupervised-sentiment-neuron/
  • 22. Further Information • Big Data Team www.ons.gov.uk/aboutus/whatwedo/programmesandprojects/theonsbigdataproject • Big data team GitHub: • https://github.com/ONSBigData • Emails: • ons.big.data.project@ons.gov.uk • Alessandra.sozzi@ons.gsi.gov.uk • kimberley.brett@ons.gov.gsi.uk • With thanks to Theodore Manassis, Charles Morris and Karen Gask