Data Acquisition for Sentiment Analysis

DASAProjectData Acquisition for Sentiment AnalysisAli Belcaid© AB Advisory& Consulting
High levelarchitecture and components overview–March 2013

Objectives
•Streamlineand facilitatethe processof unstructureddata acquisition
•Createand manage corpora’sfor contextualopinions and sentiments
•Detecttrends basedon contexctualreviews, comments, discussions…
•Runand train modelsfor sentiment or opinion analysis
•ProvideFigures, resultsand graphs as outputs

Software components
•Python
–Program language
•Django : Web application container
•Scapy: Web Crawler
•Librairies : Twitter,
•MySQL / MongoDB/ Hbase
–For the time being, no absolutechoiceismade But the final solution couldbea mix of differentdatabasesdependingon the nature of the use.
•R Project
–R Project willbeusedwheneverspecifictextmininglibrariesare missingin python or itbecomeeasierto use R insteadof python. In thatcase, the R scripts willbeencapsulatedin python programs.
•Hadoop
–For massive storagewewilluse Hadoop. The architecture isnot yetdepicted.
–It isusedfor Rawdata storage.

SimplifiedSolution Architecture
…
…
Web Interface (Django)
Crawl Engine& API
(Scrapy)
TextMiningEngine
(NLTK)
(TM –R project)
Pre- processing& Corpuses
Output results
Configuration
Crawl Content
1
2
3
4
5

Architecture components
1
Data sources : The accesswillbemanagedvia API or Crawls. Sources are all onesrelatedto social media -> blogs, forums, advisors, social web… In general, all media wheresentiment / opinion are expressed.
2
Web Interface to interactwiththe system -> to manage inputs, configurations, outputs…
3
There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing(pre- processingand analysis).
4
There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing.
5
The targetdatabasesolution isnot yetselected. The objective isto store all the relative content wheneverisrawdata, configuration items or ouputresults.

Characteristicsof Sentiment Analysis
Sentiment = Holder + Polarity + Target + Auxiliary
–Holder: who expresses the sentiment
–Target: what/whom the sentiment is expressed to
–Polarity: the nature of the sentiment (e.g., positiveor negative)
“The games in iPhone 4s are pretty funny!”
Feature/Aspect Target Polarity : Positive
Holder = the user/reviewer
Auxiliary
•Strength : Differentiate the intensity
•Confidence : Measure the reliability of the sentiment
•Summary : Explain the reason inducing the sentiment
•Time

Basic Tasks
•Holderdetection –Find who express the sentiment
•Targetrecognition –Find whom/what the sentiment is expressed towards
•Sentiment (Polarity) classification –Positive, negative, neutral
•Opinion summarization
•Opinion spam detection

Subjectivityversus Sentiment
•Sentiment analysis also known as opinion mining.
•Attempts to identify the opinion/sentiment that a person may hold towards an object
•It is a finer grain analysis compared to subjectivity analysis

Lexicon Based Sentiment Classification
Basic idea
•Use the dominant polarity of the opinion words in the sentence to determine its polarity :
•If positive/negative opinion prevails, the opinion sentence is regarded as positive/negative
•Lexicon + Counting
•Lexicon + Grammar Rule + Inference Method
Example Lexicon :
http://www.wjh.harvard.edu/~inquirer
http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
http://sentiwordnet.isti.cnr.it/

Sentiment AnalysisTasks
Level
TaskDescription
Document
•Task: sentiment classification of reviews
•Classes: positive, negative, and neutral
•Assumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.
Sentence
•Task 1: identifying subjective/opinionated sentences
•Classes: objective and subjective (opinionated)
•Task 2: sentiment classification of sentences
•Classes: positive, negative and neutral.
•Assumption: a sentence contains only one opinion; not true in many cases.
•Then we can also consider clauses or phrases.
Feature
•Task 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).
•Task 2: Determine whether the opinions on the features are positive, negative or neutral.
•Task 3: Group feature synonyms.
•Produce a feature-based opinion summary of multiple reviews.

Sometools
Lexicon-based tools
•Use sentiment and subjectivity lexicons
•Rule-based classifier
•A sentence is subjective if it has at least two words in the lexicon
•A sentence is objective otherwise
Corpus-based tools
•Use corpora annotated for subjectivity and/or sentiment
•Train machine learning algorithms:
•Naïve bayes
•Decision trees
•SVM
•…
•Learn to automatically annotate new text

Sentiment Analysis: Levels
•Document level
–E.g., product/movie review
•Sentence level
–E.g., news sentence
•Expression level
–E.g., word/phrase

Sentiment Analysis: Holderdetection
Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns
International officers believe that the EU will prevail.
International officers said US officials want the EU to prevail.
•View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneously
•Linear-chain CRF model to identify opinion sources
•Patterns incorporated as features

Sentiment Analysis: Twitter
1.Tweet normalization –A simple rule-based model –“gooood” to “good”, “luve” to “love”
2.POS tagging –OpenNLPPOS tagger
3.Word stemming –A word stem mapping table (about 20,000 entries)
4.Syntactic parsing –A Maximum Spanning Tree dependency parser

Crawlingscenario : Definition
Scenario x
Instance 1
Instance 2
Instance n
URLS sélectionnées
Paramètres de configuration
Name
Key words
…
•Scenario : 1 -> n : Category.
•Theme: n -> n : Scenario
•Scenario : 1 -> n : instance
•The scenario definethe type of Crawl wewantto run. It istiedto the notion of instance whichisconsideredas a specificconfiguration of scenario.
Module gestion des URLS
Module gestion de paramètres de configuration
Il faudra se pencher sur l’interface GUI en développement de Nutchet s’en inspirer pour la gestion des paramètres et des URLS.
Theme
Category

Data Acquisition for Sentiment Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Data Acquisition for Sentiment Analysis

Similar to Data Acquisition for Sentiment Analysis (20)

More from Ali BELCAID

More from Ali BELCAID (6)

Recently uploaded

Recently uploaded (20)

Data Acquisition for Sentiment Analysis