Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic Web Based Sentiment Engine

793 views

Published on

Imperfect look at possible applications of Web Based Sentiment Engine MECB 2012.

Sentiment analysis involves classifying opinions from text as "positive", "negative" or “neutral. Its purpose and benefit is to assist in extracting valuable information and insight from copious amounts of unstructured data. This proposed system will have the capability to determine online sentiment on current affairs for the purpose of analysis and prediction. For the sentiment analysis a cluster-method approach is recommended, which is a recent advancement in this area. Various APIs will assist in extracting other data such as location and time. Evaluation of system through the use of the Pang et al movie review data sets is recommended to validate basic functionality and real life data in the form of the 2008 US presidential race data to evaluate all functionality of the system. Multiple industries are identified as potential users of this system from marketing companies to hotels adding to our benefit in the commercialisation potential of the system.

Published in: Technology, Education
  • Login to see the comments

  • Be the first to like this

Semantic Web Based Sentiment Engine

  1. 1. CA652A Semantic Web Based Sentiment Engine A system to determine online sentiment on current affairs for the purpose of analysis and prediction 11210889 52595354 CA652A
  2. 2. ABSTRACTSentiment analysis involves classifying opinions from text as "positive", "negative" or“neutral. Its purpose and benefit is to assist in extracting valuable information and insightfrom copious amounts of unstructured data. This proposed system will have the capability todetermine online sentiment on current affairs for the purpose of analysis and prediction. Forthe sentiment analysis a cluster-method approach is recommended, which is a recentadvancement in this area. Various APIs will assist in extracting other data such as locationand time. Evaluation of system through the use of the Pang et al movie review data sets isrecommended to validate basic functionality and real life data in the form of the 2008 USpresidential race data to evaluate all functionality of the system. Multiple industries areidentified as potential users of this system from marketing companies to hotels adding to ourbenefit in the commercialisation potential of the system. 1|Page
  3. 3. A report submitted to Dublin City University, School of Computing for moduleCA652: Information Access, 2011/2012.We hereby certify that the work presented and the material contained herein ismy/our own except where explicitly stated references to other material are madeStudent Numbers5259535411210889 2|Page
  4. 4. TABLE OF CONTENTSAbstract .................................................................................................................................... 1Introduction ............................................................................................................................ 5Concept Overview ................................................................................................................. 5 Constraints and Limitations ............................................................................................ 5Functional Description ......................................................................................................... 6 Sentiment Search Functions............................................................................................... 6 Techniques ........................................................................................................................... 6 Time parameter Based Search ....................................................................................... 8 Geographical Extraction Based ..................................................................................... 8 Social Sentiment Extraction Based data ....................................................................... 9 Graphical Data Generation Tools ................................................................................. 9 Pros & Cons of proposed system ...................................................................................... 9Evaluation Plan..................................................................................................................... 10 Stage One Testing - Validation ..................................................................................... 10 Stage Two Testing – Functionality Testing ................................................................ 11 Stage Three Testing – Real Life Data ........................................................................... 11Commercialisation Potential ............................................................................................. 13Conclusion and Further Research Opportunities .......................................................... 14References .............................................................................................................................. 15 3|Page
  5. 5. Table of FiguresFigure 1 - Sentiment Analysis framework ........................................................................... 7Figure 2 - Cluster Method Accuracy/Efficiency ................................................................ 8Figure 3 - Graphical Representation of content .................................................................. 9Figure 4 - Basic Validation Testing Results ....................................................................... 11Figure 5 - Two Topic Validation Testing ........................................................................... 11Figure 6 - Sample Test Output (Obama) ............................................................................ 12Figure 7 - Sample Test Data (McCain) ............................................................................... 13 4|Page
  6. 6. INTRODUCTIONThe ‘media’ as we now conceptualise it has changed dramatically. With the internet,people have an opportunity to ‘weigh in’ on events, by providing their opinions, andfeedback and in real time through blogs, forum, social networks and commentingsystems on news websites. There is a growing interest in measuring sentiment thatcan be contributed to the dramatic increase in the volume of digitized information.“An increasing number of studies in political communication focus on the “sentiment” or“tone” of news content, political speeches, or advertisements” (Young, L, & Soroka, S 2012)This report discusses the concept of developing a Semantic Web based sentimentengine that will be able to analyse public sentiment on current issues, from politicsto reality TV shows. Based on the analysis, tracking of popular opinion throughsocial media channels and leveraging research in the area of sentiment analysis,accurate predictions could be made possible on events from presidential elections tothe X-Factor competition.CONCEPT OVERVIEWThis proposed system is not a standard sentiment engine that returns static data; itoffers increased functionality to assist with data interpretation. By allowing endusers to customise their search, filter the returned data under multiple parametersand have graphical representation of results to facilitate interpretation.CONSTRAINTS AND LIMITATIONSThe limitations of this concept are not due to the technological constraints but aresimply down to the volatility of public opinion and that is something that cannot beremedied or correcting by technology.Another limitation is the scope of the opinion being captured. User groups of socialmedia and participants in online forums are statistical of a younger age group. Thelack of inclusion of the opinion of older age groups could greatly affect the accuracy 5|Page
  7. 7. of the data as it would not be entirely representative – the impact of this imbalancewould particularly impact politics with older groups statistical more likely to vote.FUNCTIONAL DESCRIPTIONSENTIMENT SEARCH FUNCTIONS • Users can enter multiple search terms for the purpose of data comparison. Other features would be utilised to improve the analysis returns. • Multiple Search Parameters o Time Frame Defined Search - Data retrieved can be limited to a specific time frame. o Geographical Location Based Search – Search data retrieved can be filtered by location of users o Narrow Search Scope – Select websites to exclude or restrict search to small number of websites. • Graphical representations of the data are generated.TECHNIQUESSentiment Analysis TechniquesThere is much research in the area of sentiment analysis, the primary objective beingto find a technique where there is no trade-off between speed and accuracy. Severalnew and emerging techniques have been researched as part of identifying the best fitfor this system. • Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011) o This proposed method uses proximity-based features to determine sentiment; proximity distribution, mutual information between proximity types, and proximity patterns. 6|Page
  8. 8. • Based on Annotation (Shukla, A 2011) o This proposed method counts all the annotation present, calculates sentiment scores of all annotation including comments to determine sentiments. • Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011) o This proposed method uses SentiWordNet to calculate the semantic ‘score’ of sentences it has classified as subjective from reviews and blog comments. • Machine Learning approach to contextual information (YANG, C et al, 2008) o This proposed method differentiates itself from others by taking context into account when determining the sentiment category. Its primary focus and test data sets have been blog posts. Figure 1 below, shows the framework employed. FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK • Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012)The method deemed most appropriate for this proposed system was based on aarticle from the Journal Of Information Science in April this year, which outlined theClustering-Based Sentiment Analysis approach. It proposed that by applying a “TF-IDF weighting method, a voting mechanism and importing term scores, an acceptable andstable clustering result can be obtained” (Li, G, & Liu, F 2012) The evaluation results 7|Page
  9. 9. were the most impressive of all techniques reviewed as part of this research. Itappears to have performed well in terms of both accuracy and efficiency with noneed for human participation, as can be seen from figure 1. FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCYApart from its accuracy and efficiency, this technique was deemed the most suitableas it can be applied universally to any data set. Other techniques researched, havebeen developed for particular data types, customer reviews or blogs and theirevaluation appraisals appear to suggest they do not perform as well outside of thesedata types.TIME PARAMETER BASED SEARCHThis sentiment engine would make use of the adaptible Librato API libraries toallow sentiment returns to be time sensative. This would be in order for a user toevaluate how sentiment is changing over time or what sentiment was duringspecific time periods.GEOGRAPHICAL EXTRACTION BASEDAdding a geographical element would be a unique feature allowing for mapping ofsentiment results. Preferred location content will be pulled from the Twitter API asit gives access to Twitter profile location. Comment systems used by news websitesetc. request a location prior to posting the comment like on the Irish Times website.Facebook API allows access to location of user if the privacy setting is turned on.OAUTH setting would be used to allow the users of the sentiment engine to explorethe opinions of their friends and networked associates and how it would fit on thesentiment scales. Other free use location APIs may also be needed. 8|Page
  10. 10. SOCIAL SENTIMENT EXTRACTION BASED DATAThe content used to create athematrix of information to evaluate sentiment withinvia FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre;Intensedebate; Drupal comments; Wordpress comments; other blog posts; scrapedopen facebook and fan page comments; facebook comment system; text comments;G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarkingsite comments like fark.com reddit; and other language relavent wire news services.GRAPHICAL DATA GENERATION TOOLSGraphical representations of the data are generated. The results could be rendered asweb-based flash objects or in way that is complient to the evolving HTML5standards and be IOS 5 comlient given the anamosity Apple has with Adobe overflash for results to be useful on mobile devices and tablets. These reports woud beexportable to Crystal Reports. 1600 1400 1200 1000 800 Candidate A 600 Candidate B 400 200 0 Postive Neutral Negative FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENTPROS & CONS OF PROPOSED SYSTEMThe primary argument for why sentiment engines via Semantic Web and linked dataare useful is based upon the new information and insight that can be gleaned from it.The ability to know relative and positional sentiment can be useful in many anyticalor informational arbitrage situations. 9|Page
  11. 11. In terms of the cons, primary concern would be data quality. Problems with dataquality are a huge issue and can skew any resulting analysis. The extent of the dataquality problem has been often discovered by information activists working in theopen data movement.Secondly privacy concerns and staying within the spirit and letter of the relaventdata privacy laws of the regulatory regime you operate under may at times be anissue. This can be tricky given the interconnected nature of the web.Lastly, inaccuracies of data and it being organisied in “short sets” vs deeper datamay create false sentiments. Is their enough data being looked at to create a realistpostive or negative sentiment? Some additional analysis may need some additionparsing to tease out, for example, intial heated emotion responses from the rationalemorning after response.EVALUATION PLANSTAGE ONE TESTING - VALIDATIONThe evaluation plan would begin with simple software validation. The first test casewould consist of validating the fundamental functionality of the system, its ability todifferentiate between sentiments. The data set that’s to be used is the movie reviewdata from Pang et al experiments1 Movie review data is widely regarded as the mostchallenging data for sentiment engines to analysis, this can be contributed to the factthat a positive review may contain descriptions of gory or violent scenes and equallya negative review could contain descriptions of light-hearted pleasant scenes. Foradditional testing other data sets could be used for each iteration of this dynamictesting stage1 Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification usingmachine learning techniques. In: Conference on empirical methods in naturallanguage processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79. 10 | P a g e
  12. 12. 20% 39% Neutral Positive 41% Negative . FIGURE 4 - BASIC VALIDATION TESTING RESULTSSTAGE TWO TESTING – FUNCTIONALITY TESTINGThe second stage of testing would be the validation of the multiple inputfunctionality; to ensure that data can be retrieved for two or more search terms andalso that they can be accurately differentiated. The test case for this would be builton the first stage of testing with added content regarding a second movie etc. Schlinders List The Usual Suspects 39% 20% 20% 21% Neutral Neutral 41% Positive Positive 59% Negative Negative FIGURE 5 - TWO TOPIC VALIDATION TESTINGSTAGE THREE TESTING – REAL LIFE DATAThe final stage of the evaluation plan would be to perform testing using previoushigh profile events as the test cases, such as the US Presidential Election of 2008 and 11 | P a g e
  13. 13. the X-Factor competition from previous years. This validation is more complex as itwill span the entire internet not just the staging website.The testing would be performed over different time intervals, days, weeks, months,and the entire duration of the event. In the case of the political elections these timeperiods could be used to coincide with official opinion polls, for example Gallop andRasmussen state side or RedC for Irish based events.Validation of the geographical based sentiment analysis function would be tested togauge the accuracy of the location results. In the case of the US Presidential Electionthe final voting percentages for each candidate per state would give an accuratebasis for comparison.SAMPLE EVALUATION TEST CASEBy taking the ten states where each candidate won by the largest percentagemajority, and graphing the percentage of votes each candidate received, and also thepercentage of positive, negative and neutral data regarding that candidate. What onewould expect in a fully evaluated system would be a close correlation betweenpositive data and the percentage of votes and also a correlation with the negative orneutral data and the other candidate’s percentage of votes, as per the sample chartsbelow for Obama and McCain respectively. 90 Obama’s Data 80 70 Obamas Percentage 60 of Votes 50 McCains Percentage 40 of Votes 30 Positive % 20 10 Negative % 0 Neutral % FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA) 12 | P a g e
  14. 14. 70 McCain’s Data 60 McCains Percentage 50 of Votes 40 Obamas Percentage of Votes 30 Positive % 20 Negative % 10 0 Neutral % FIGURE 7 - SAMPLE TEST DATA (MCCAIN)COMMERCIALISATION POTENTIALIn an era where both business and individuals are attempting to move further andfurther to data driven decision sentiment engine products have a range ofcommercial potential.Some companies have already begun commercializing Semantic Web applicationslike IBM licensing of their WebFountain Internet analytical engine to FActiva andThompsonReuters 2003 for example for those interested in corporate reputationaldata.Various market research for people who cannot afford Enterprise Resoruce Planning(ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and forwho the current available crop of free semantic sentiment engines (name a few fromthose ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). SemanticWeb products are becoming important in internal and external Business Inframatics.However, information arbitrage is not merely for professional market traders. Thissystem would likely be a software as service (SaaS) on the web, it could be sold on afree-mium basis or a monthly subscription or yearly license depending on theimplementation. 13 | P a g e
  15. 15. Primary clients would depend on the sentiments needing to be parsed and theproprietary and public data sets being used in within the sentiment engine.Examples to be included: Corporate Media; Content Publishing industry; PR firms;polling; market research firms; Trading platforms; Political Parties; Elections;Government agencies; security services; and Bookmarkers for deciding odds onNovelty bets - reality TV shows, politics etc.CONCLUSION AND FURTHER RESEARCH OPPORTUNITIESWhere does the Semantic Web lead to exactly? We don’t really know, but openingup the segregated data silos and making sense of deeper dark ‘big data,’ in pursuitof the benefits of a deeper rooted “hyperdata” would be a nice path. However, theroad will be long but it may improve our day to day lives immensely. "Many applications and services claim to be "semantic" in one manner or another, but that does not mean they are "Semantic Web." Semantic applications include any applications that can make sense of meaning, particularly in language such as unstructured text, or structured data in some cases. By this definition, all search engines today are somewhat "semantic" but few would qualify as "Semantic Web" apps. (Spivak, 2007)How we get from the early steps of Web 3.0 to this deeper data web will be a longprocess. It will provide countless benefits, many of which we may not even percievetoday. However, sentiment engines are mearly one way to get the public and thedeveloper community interested and excited for all the other benefits that this opendata future could hold. For that reason sentiment engines will remain an importantcomponent in the near term future, as “big data,” holds much of the future promiseto bring the of the “web of things” and make sense and use of them. 14 | P a g e
  16. 16. REFERENCESAbbasi, A, Hsinchun, C, & Salem, A 2008, Sentiment Analysis in MultipleLanguages: Feature Selection for Opinion Classification in Web Forums, ACMTransactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied SciencesComplete, viewed 4 May 2012.Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse MakeUse Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentiment-search-feel-pulse/ [Accessed 1 May 2012]Bergman, Mike 2010. I Have Yet to Metadata I Didn’t Like. AI3 [Online] 16 August.http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed1 May 2012]Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stockmarket. Journal of Computational Science, 2(1), Pages 1-8 Available from:http://arxiv.org/abs/1010.3003Cai, K, Spangler, S, Ying, C, & Li, Z 2010, Leveraging sentiment analysis for topicdetection, Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic SearchComplete, viewed 20 April 2012.Dalton, Jeff 2007. Caffè Java Open Source NLP and Text Mining tools. Jeffs SearchEngine Caffé [Online] 16 March. http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html [Accessed 1 May 2012]Hamouda, A, Marei, M, & Rohaim, M 2011, Building Machine Learning Based Senti-word Lexicon for Sentiment Analysis, Journal Of Advances In Information Technology,2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with FullText, , viewed 1 May 2012.Hasan, S, & Adjeroh, D 2011, Detecting Human Sentiment from Text using aProximity-Based Approach, Journal Of Digital Information Management, 9, 5, pp. 15 | P a g e
  17. 17. 206-212, Library, Information Science & Technology Abstracts with Full Text, ,viewed 7 May 2012.Kang, H, Yoo, S, & Han, D 2012, Senti-lexicon and improved Naïve Bayesalgorithms for sentiment analysis of restaurant reviews, Expert Systems WithApplications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April2012.Lévy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWWConsortium Conference / Banff 2007 Available from:http://www.ieml.org/text/semantic_space.pdfLi, G, & Liu, F 2012, Application of a clustering method on sentiment analysis,Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, ,viewed 21 April 2012.Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machinelearning techniques. In: Conference on empirical methods in natural languageprocessing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.Shukla, A 2011, SENTIMENT ANALYSIS OF DOCUMENT BASED ONANNOTATION, International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103,Computers & Applied Sciences Complete, , viewed 6 May 2012.Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata.novaspivack.typepad.com [Online] 18 September.http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html[Accessed 1 May 2012]Vishwanath, J, & Aishwarya, S 2011, User Suggestions Extraction from customerReviews: A Sentiment Analysis approach, International Journal On Computer Science& Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012.YANG, C, LIN, K, & CHEN, H 2008, Sentiment Analysis in Weblog UsingContextual Information:: A Machine Learning Approach, International Journal Of 16 | P a g e
  18. 18. Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, ,viewed 27 April 2012.Young, L, & Soroka, S 2012, Affective News: The Automated Coding of Sentiment inPolitical Texts, Political Communication, 29, 2, pp. 205-231, Academic SearchComplete, , viewed 10 May 2012. 17 | P a g e

×