Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)

535 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
535
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)

  1. 1. Automatic discovery of emotion- based communication on social networking sites of South Korean politicians : Cyworld comments<br />Steven Sams<br />WCU WeboMatrix, Yeungnam University,<br />214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749<br />steven.sams@brunel.ac.uk<br />Wojciech Gryc<br />Computing Laboratory, Oxford University, Wolfson Building,<br />Parks Road, Oxford OX1 3QD, UK<br />wojciech.gryc@trinity.ox.ac.uk<br /> Han Woo Park(Corresponding author)<br />Dept of Media & Communication, Yeungnam University,<br />214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749 <br />hanpark@ynu.ac.kr<br />A paper submitted for possible presentation to the 2nd International Conference on u- and e- Service, Science and Technology (UNESST), December 10~12, 2009, International Convention Center Jeju, Jeju Island, Korea. http://www.sersc.org/UNESST2009/<br />
  2. 2. Abstract<br />This article examines a user’s textual comments on social networking sites of politicians. <br /> South Korean politicians were identified and all comments given to them by other users within a specified timeframe were gathered.<br /> The techniques employed in this research were particularly useful in automatically classifying the sentimental emotions made by citizens.<br />Key words<br />Cyworld<br />South Korea <br />political communication<br />social networking sites<br />Sentimental Analysis <br />e-research<br />
  3. 3. Contents<br />
  4. 4. 1 Introduction<br /> Recent social media offers us a new way of communicating and it helps politicianscreate an efficient method to communicate citizens. <br /> Therefore,weaimtodevelopsoftware that can capture a massive political communication usage data generated by social networking sites and citizen’s social media sites. Specifically, on Cyworld.<br />
  5. 5. The title of the Mini-Homepage<br />Counts of visitors<br />Mainmenu<br />Favorite menu<br />①<br />②<br />③<br />Condition of the host<br />The status of the Mini-Homepage<br />①How active ②How famous ③How friendly<br />Mini room (Editable by the host)<br />Basic information of the host<br />Link : CyworldMini-Homepage(Geun-Hye Park)<br />
  6. 6. 2 Related Studies<br />1. Automatic tracking online political communication(Therepresentativeexercise has been conducted by Canada’s Infoscape Lab). <br /> 2. Sentimentalanalysisofsocialnetworking sites.<br /> ThesetoolsapplyonlyEnglishandgender differences in political user-generated feedback haven’t be fully explored. <br /> This presents an opportunity to examine South Korean politicians’ networking site and observe the citizen’s behavior.<br />
  7. 7. 3 Data collection<br /> TheURLsofoppositionandrulingSouth Korean politicians who own social networking sites on Cyworld, as of 21st May 2009, were located. <br /> The comments on their Cyworld between 1st April 2008(&lt;- the month of National Assembly Member election)and 14th June 2009 were automatically collected.<br />
  8. 8. 3 Data collection<br /> Collect only the gender and comment of the post-generator information due to the criticism on the lack of anonymity andtheabsenceof the clear willingness of participation. <br /> The data relating to these users was removed from the study. Of the 90 political profiles, except nine had high private options or had no post, 81 were successfully scraped.<br />
  9. 9. Table 1. Summary of comments posted on ten political profile pages between April 2008 and June 2009.<br />One politician was selected at random from the eighty-one successfully scraped political profiles and the male and female comments posted were taken as the dataset. <br />
  10. 10. 3 Data collection<br /> From this data, two hundred random comments were taken and categorized in one of three possible groups.<br /> The posts were all labelled by one individual so that the reliability metrics are not available. <br /> (It is discussed in Future Work)<br /> The number of categorized posts in each category is included in Table 1<br />
  11. 11. 4 Machine learning method<br /> Naïve Bayes multinomial models and support vector machines were used to build the machine learning, <br /> combined with voting and stacking frameworks. <br />(a “bag of words” approach was used) <br />
  12. 12. 4 Machine learning method<br /> The first approach used was a naïve Bayes multinomial model. <br /> A support vector machine using a polynomial kernel was also used in the classification task.<br />
  13. 13. 4 Machine learning method<br /> The NB model and support vector machine were also combined using a voting system. Each model received one vote, with a third vote being automatically assigned to the class with the largest population.<br />
  14. 14. 4 Machine learning method<br /> A final approach was the use of a stacking framework. In this case, the output of the models are input as variables into a logistic regression model.<br />
  15. 15. 5 Feature analysis<br /> The features used as inputs into the machine learning algorithms consisted of word counts of all the words that appears 2 or more times in either the male or female set of posts. This resulted in a feature list of 484 words. Table 2 shows the words that appear to be the most significant features, as determined by a χ2 test for significance of features. <br />
  16. 16. 6 Categorization Results<br /> The four algorithms were trained on the data set, and results are all very similar. (Note:“LargestClass”rowrepresentsthe accuracy of the model if all comments were labelled as positive.) <br /> Allfouralgorithmsoutperform, and do so atastatisticallysignificantlevel(at p = 0.05). <br /> Interestingly, All four algorithms have fairly similar accuracies to each other. <br />
  17. 17. 6 Categorization Results<br /> HoweverTheyhavedifferent strengths. <br /> While both accuracies(the voting classifier, represented in Table 4. / the stacked classifier, represented in Table 5.) are similar, it appears that the voting classifier is biased in favour of positive classifications, while the stacked classifiertends to label more posts as negatives.<br />
  18. 18. Table 4. Confusion matrix for the voting classifier.<br />Predicted<br />Actual<br />Table 5. Confusion matrix for the stacked classifier.<br />Predicted<br />Actual<br />
  19. 19. 6 Categorization Results<br /> The reason there are three curves for the three classes is that each curve represents the model&apos;s ability to label a specific comment as within a specific category, or outside of it. <br /> The class with the highest estimated probability is the one assigned to the actual comment.<br />
  20. 20. 6 Categorization Results<br /> It means the classification algorithm creates three sub-classifiers(favourable or not, unfavourable or not, irrelevant or not). <br /> The ROC curves show how accurate the estimated probabilities are. Each point on a ROC curve shows how manyfalse positives and true positives occur for a specific probability threshold.<br />
  21. 21. 6 Categorization Results<br /> Figure 1 shows the ROC curves for the three classes. It shows how the classifierstend to do well with the three classes.<br /> Overall, it is useful to see that the algorithms are able to discern between irrelevant, positive andnegativeposts. Whileaccuraciescanstillbeimproved, the results are very encouraging.<br />
  22. 22. Fig. 1 Receiver Operator Characteristic (ROC) curves for the stacked classifier, for each class in the dataset.<br />
  23. 23. 7 Further work<br /> It would be interesting to apply categorization methods using natural language processing(NLP)techniquesinthestudyto see how knowledge of the grammatical structure of the post could help with the labelling process. Furthermore, information on the participants ofCyworld, such as location, political affiliation was not used in categorizing posts. <br />
  24. 24. 7 Further work<br /> The authors are also planning to expand the study to multiple labellers to help understand how difficult and reliable human labelling actually is.<br /> However, the results above illustrate that labelling posts by sentiment is not an intractable problem and useful machine learning approaches exist.<br />
  25. 25. 8 Conclusion<br /> The development of these tools provides an efficient means to study emotional based political communication andaddressespreviouscriticismof the lack of anonymity. One of the many advantages of this technique is a sentimental analysis of user-generated feedback when interacting with political social networks. <br />
  26. 26. The development of these tools provides an efficient means to study emotional based political communication and addresses previous criticism of the lack of anonymity. One of the many advantages of this technique is a sentimental analysis of user-generated feedback when interacting with political social net works. The potential for usage of these tools could be applied to social network analysis in nations where political social networks are established or extend beyond an analysis of political subjects and explore habits by the online general public.<br />8 Conclusion<br />The potential for usage of these tools could be applied to social network analysis in nations where political social networks are established or extend beyond an analysis of political subjects and explore habits by the online general public.<br />Acknowledgments.The correspondence author acknowledges that this research is supported from the WCU project (Investigating an internet-based politics using e-research tools) granted from South Korean Government<br />

×