Cho, Seung Woo, et al. "Investigating Temporal and Spatial Trends of Brand Images Using Twitter Opinion Mining." Information Science and Applications (ICISA), 2014 International Conference on. IEEE, 2014.
3. Introduction
Twitter Crawling
Data Pre-processing
Korean Morphology Analysis
Twitter Opinion Mining
Sentiment Dictionary
Evaluating performance of candidate classifiers
Sentiment Classification
Visualize Associative Relationship of Terms
Relationship with Brand Index
4. Twitter Crawling
Twitter API
Streaming API
REST API
- Search API
Get 1% of all
twitter data in
real time
Get twitter data
from the keyword
2013.9.9.Mon. 9:35pm ~ Now
About 10,000 ~ 15,000 tweets per a day
Total 1,220,000 tweets (2013.11.2.Sat)
5. Data Pre-Processing
Only get tweets which contain at least more than 3 Korean characters and tweets within
a 500km radius of Seoul, Korea.
To remove foreign languages, special characters
Remove tweets which only contain location information.
Remove retweets
ويتكلم نهائيا السمع فقد متعب ابو الملك ان خبر اكد المستوى رفيع وامير موثوق صدر
مفهوم وغير مترابط غير كالم((تخريف::)) Sat Oct 12 00:06:37 KST 2013
I'm at Club ELLUI - @ellui_seoul (서울특별시) w/ 2
others http://t.co/zhcrncosKH::Sat Oct 12 00:02:06 KST 2013
6. Korean Morpheme Analyzer
꼬꼬마 Korean Morpheme Analyzer
한나눔 Korean Morpheme Analyzer
Komoran Korean Morpheme Analyzer
Lucene Korean Analyzer
은전한닢 Korean Morpheme Analyzer
Performance of the analyzer
Foreign language and slang tagging
Sentiment related word tagging (slang,
verb, emoticon)
It has good dictionary
Don’t need to think about word spacing
But, unable to perceive lots of emoticons,
metaphor, sarcasm, irony.
7. Korean Morpheme Analyzer
> 배가 아파서 병원에 갔다.
배 NN,F,배,*,*,*,*,*
가 JKS,F,가,*,*,*,*,*
아파서 VA+EC,F,아파서,Inflect,VA,EC,아프/VA+ㅏ서/EC,*
병원 NN,T,병원,*,*,*,*,*
에 JKB,F,에,*,*,*,*,*
갔 VV+EP,T,갔,Inflect,VV,EP,가/VV+ㅏㅆ/EP,*
다 EF,F,다,*,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
Noun
Verb
Adjective
Adverb
Root
8. Building Sentiment Dictionary
Manually labeled twitter data
1 • 6 days of twitter data (2013.9.9, 9.16, 9.23, 9.30, 10.7, 10.14)
• Labeled positive and negative sets of Noun, Adjective, Verb, Root (total 8 sets)
• Labeled by 4 person
2 • 20,000 reviews from 2 movies
• 545 positive set, 545 negative set,
545 neutral set
Naver Movie review data with rating
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10
Positive
Positivenegative
Movie 1 Movie 2
9. Sentiment Classification
SVM Classifier
1. Training set - 150 positive set, 150 negative set (Twitter data)
2. Test set – 545 positive set, 545 negative set (Movie review data)
Accuracy = 70.64220183486239% (770/1090) (classification)
Mean squared error = 1.1743119266055047 (regression)
Squared correlation coefficient = 0.18400994471523438 (regression)
Naïve bayes Classifier
SO-PMI Classifier
10. Building Sentiment Dictionary
Unlabeled &
labeled data set
Ternary classifier : Naïve Bayes,
SO-PMI, SVM
Positive
set
Negative
set
Neutral
set
Positive
set
Negative
set
Neutral
set
Positive
set
Negative
set
Neutral
set
SO-PMI
SVM
Naïve Bayes
11. Sentiment of Brand Index
Samsung
Galaxy S2
Battery LCDPrice ….
: Brand (keyword)
: Related nouns (attribute)
Adjective
Verb
Noun
Adverb …
correlation
good
good nice
good good
Nice, pretty,
lovely …
Bad, terrible …
PMI(word, pword) + PMI(word, nword)
Determining
Objectivity
SNS(SocialNetWorkServic) 시작 확대 -> 개인 BigData 출현
BigData를 이용한 DataMining 대두
트위터롤로지(twitterology) 새로운 학문의 출현
- 트위터를 연구하는 학문’을 뜻하는 신조어
- 소셜네트워크서비스(SNS)인 트위터(twitter)에 학문을 뜻하는 접미사 로지(-logy)
- 트위터의 실시간 정보가 사회학 경제학 의학 언어학 등의 연구
Twitter 4J library를 이용한 Streaming API (실시간)와 REST API(15분에 420회- 15분마다 요청하면 420개 받음) 구현
전체 데이터의 1%만 받을 수 있음 – 승우 발표
9월 9일 9:35pm ~ 지금도 계속
하루 평균 만~만오천개의 데이터
현재 2013.11.2 122만개의 데이터 축적
한글 3글자 이하는 받지않음 (특수문자 다빠지고, 영어, 일본어 다 빠짐)
위치정보 imap 등의 정보 제거
서울 반경 500km 이내의 데이터 받음 (전세계의 트위터가 다나옴. 우리나라꺼만 받기위해)
은전한닢 형태소분석기
리눅스에서 자바연동
1. Training set - 긍정 : DB 검색 '좋' 결과 - 이중 150개 부정 : DB 검색 '싫' 결과 - 이중 150개 2. Test set - 긍정 : 영화평 545개 부정 : 영화평 545개
사전에 아예 걸리지 않은 영화평도 포함하였을 때
optimization finished, #iter = 73 nu = 0.16326140616206591 obj = -32.23746306073249, rho = 0.11723225832508417 nSV = 61, nBSV = 38 Total nSV = 61 Accuracy = 70.64220183486239% (770/1090) (classification) Mean squared error = 1.1743119266055047 (regression) Squared correlation coefficient = 0.18400994471523438 (regression)
p(word1 & word2) is the probability that word1 and word2 co-occur
f the degree of statistical dependence between the words
The log of the ratio corresponds to a form of correlation