Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Mining Social Media Data

222 views

Published on

Tutorial at the Alberto Mendelzon Workshop, Colombia May 2018

Published in: Science
  • Be the first to comment

Introduction to Mining Social Media Data

  1. 1. 1! Alberto Mendelzon Workshop 21th May 2018 1! Introduction to Mining Social Media Data Miriam Fernandez Knowledge Media Institute Open University, UK @miriam_fs @miriamfs Credit to all these fantastic people!!
  2. 2. 2! Alberto Mendelzon Workshop 21th May 2018 2! Who we are? 2
  3. 3. 3! Alberto Mendelzon Workshop 21th May 2018 3! Before we start… •  1.- This is an after lunch session… –  Hope you took the necessary precautions! •  2.- It is an introductory tutorial –  If you were expecting something very complex this is not for you, go out and enjoy the sun J •  3.- I hate talking alone for long periods of time –  Please ask or discuss anything you want at any point! •  4.- hands-on excercises available –  Fantastic tutorial @TheWebConf by some of my colleagues! J https://github.com/evhart/smasac-tutorial/blob/ master/README.md (jupyter notebooks)!
  4. 4. 4! Alberto Mendelzon Workshop 21th May 2018 4 Understanding Social Media
  5. 5. 5! Alberto Mendelzon Workshop 21th May 2018 5! Most Used Social Media Platforms Source: https://techcrunch.com/2017/06/27/facebook-2-billion-users/
  6. 6. 6! Alberto Mendelzon Workshop 21th May 2018 6! Not the Only Ones Smaller and less famous (open and closed) communities addressing particular geographic regions, specific user groups or niche interests thrive on the Web!
  7. 7. 7! Alberto Mendelzon Workshop 21th May 2018 A World-wide Phenomenon Number of social network users worldwide in billions! Source: https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
  8. 8. 8! Alberto Mendelzon Workshop 21th May 2018 Number of social network users in selected countries (in millions)! Source: https://www.statista.com/statistics/278341/number-of-social-network-users-in-selected-countries/
  9. 9. 9! Alberto Mendelzon Workshop 21th May 2018 Full of Challenges
  10. 10. 10! Alberto Mendelzon Workshop 21th May 2018 Mining Social Media data, for What? Trivalent: http://trivalent-project.eu/ COMRADES: https://www.comrades-project.eu/ DecarboNet: https://www.decarbonet.eu/ Sense4us: http://www.sense4us.eu/ ROBUST: http://www.robust-project.eu/ OUSocial: http://oro.open.ac.uk/40883/1/ousocial2-demo.pdf Some of the next slides from: https://www.slideshare.net/halani
  11. 11. 11! Alberto Mendelzon Workshop 21th May 2018 Studying social phenomena at scale!
  12. 12. 12! Alberto Mendelzon Workshop 21th May 2018 Social Semantic Statistical Analysis
  13. 13. 13! Alberto Mendelzon Workshop 21th May 2018 Businesses •  Many businesses provide online communities to: –  Increase customer loyalty –  Raise brand awareness –  Spread word-of-mouth –  Facilitate idea generation •  Online communities incur significant investment in terms of: –  Money spent on hosting and bandwidth –  Time and effort for maintenance •  Community managers monitor community ‘health’ to: –  Ensure longevity –  Enable value generation •  However, the notion of ‘health’ is hard to pin down http://www.robust- project.eu/
  14. 14. 14! Alberto Mendelzon Workshop 21th May 2018 Businesses Monitoring of evolution of community activities and level of contributions in SAP Community Networks – SCN
  15. 15. 15! Alberto Mendelzon Workshop 21th May 2018 Reputation Fish Tank https://www.youtube.com/watch?time_continue=57&v=KXRzdrDDt_8!
  16. 16. 16! Alberto Mendelzon Workshop 21th May 2018 Active OU communities on Facebook
  17. 17. 17! Alberto Mendelzon Workshop 21th May 2018 •  How ac've and engaged the course group is? •  How is sen'ment towards the course evolving? •  Are the leaders of the group providing posi've/ nega've comments? •  What topics are emerging? •  Is the group flourishing or diminishing? •  Do students get the answers and support they need or not? DEMO Education
  18. 18. 18! Alberto Mendelzon Workshop 21th May 2018 OUAnalyse •  Social media data vs. VLE data to increase retention Names https://analyse.kmi.open.ac.uk/
  19. 19. 19! Alberto Mendelzon Workshop 21th May 2018
  20. 20. 20! Alberto Mendelzon Workshop 21th May 2018
  21. 21. 21! Alberto Mendelzon Workshop 21th May 2018 Automatic Categorisation of Social Media Accounts •  Objective: –  Provide automatic identification of the main actors talking about policy in social media –  Allow policy researchers to concentrate on the opinions of citizens vs. commercial organizations •  Approach Twitter Data Data Collection Feature Engineering User Classification Person Company NGO MP News & Media
  22. 22. 22! Alberto Mendelzon Workshop 21th May 2018 Policing Olson’s psychological theory of luring communication (LCT) Grooming data •  Classification results: –  Trust development: 79% P, 82% R, 81% F1 –  Grooming stage: 88% P, 89% R, 88% F1 –  Physical approach: 87% P, 89% R, 88% F1
  23. 23. 23! Alberto Mendelzon Workshop 21th May 2018 Energy
  24. 24. 24! Alberto Mendelzon Workshop 21th May 2018 Disaster Management 177 million tweets were posted in a single day during the 2011 Japan earthquake Boston Marathon Bombing broke on Twitter. On the news, 3 hours later!
  25. 25. 25! Alberto Mendelzon Workshop 21th May 2018 Ushahidi
  26. 26. 26! Alberto Mendelzon Workshop 21th May 2018 •  Crisis-related event detection is often divided into three main tasks [Olteanu et al. 2015]: Crisis-based Event Detection Tasks Task 1. Crisis vs. non-Crisis Related Messages Task 2. Type of Crisis Task 3. Type of Information Differentiate those posts that are related to a crisis situation vs. those posts that are not Identify the different types of crises the message is related to Differentiate those posts that are related to a crisis situation vs. those posts that are not Shooting, Explosion, Building Collapse, Fires, Floods, Meteorite Fall, etc. Affected Individuals, Infrastructures and Utilities, Donations and Volunteer, Caution and Advice, etc. Granularity
  27. 27. 27! Alberto Mendelzon Workshop 21th May 2018 Disaster Management https://evhart.github.io/crees/
  28. 28. 28! Alberto Mendelzon Workshop 21th May 2018 Be aware of the problems! Fernandez, M., and Alani H. "Online Misinformation: Challenges and Future Directions." Companion of the The Web Conference 2018. http://oro.open.ac.uk/53734/ https://kmitd.github.io/recoding-black-mirror/ http://www.aolteanu.com/SocialDataLimitsTutorial/
  29. 29. 29! Alberto Mendelzon Workshop 21th May 2018 29! Strong need of Ethics!
  30. 30. 30! Alberto Mendelzon Workshop 21th May 2018 30! Re-coding Black Mirrors
  31. 31. 31! Alberto Mendelzon Workshop 21th May 2018 31! Bias on the Web at all levels! http://www.aolteanu.com/SocialDataLimitsTutorial/
  32. 32. 32! Alberto Mendelzon Workshop 21th May 2018 Some considerations when collecting data •  Automatic access to social media data can be restricted in different ways: –  Public / Non-public data: Most social media websites do not allow access to the information posted unless reading access is given explicitly by the information creator. –  Query restrictions: Data access can be limited by API restrictions (e.g., rate limiting, query allowance). –  Data Sampling: High velocity data is sometimes sampled by social media companies. As result, it is only possible to retrieve a portion of the relevant information. –  Query Filtering: Often data is retrieved using query parameters (e.g., keywords, geolocation, etc.). Missing information / biased information
  33. 33. 33! Alberto Mendelzon Workshop 21th May 2018 Some considerations when analysing data –  User type may vary (e.g., news organisations, journalist, companies, government, NGOs, etc.) –  Populations may be biased (e.g., not all distributions of ages/ gender / political views / etc.) –  Type of information shared may vary: (e.g., during a disaster you may have messages about: affected individuals, caution and advice, donation or volunteering, message of support, etc.) –  Type of content shared may vary (e.g., text, images, videos, links). –  Target audience may vary (e.g., general public, other organisation, followers, friends/family). –  Social media platforms to communicate the message may vary, or more than one may be in use (e.g., Facebook, Twitter, etc.)
  34. 34. 34! Alberto Mendelzon Workshop 21th May 2018 34! https://shorensteincenter.org/information-disorder-framework-for-research-and-policymaking/
  35. 35. 35! Alberto Mendelzon Workshop 21th May 2018 35! Types of Misinformation and Disinformation 7 Types of Mis- and Dis-information (Credit: Claire Wardle, First Draft)
  36. 36. 36! Alberto Mendelzon Workshop 21th May 2018 36! Affecting the decision making processes in many domains
  37. 37. 37! Alberto Mendelzon Workshop 21th May 2018 37! Dimensions of Combating Online Misinformation •  Misinformation content detection –  Are misinformation content and sources automatically identified? Are streams of information automatically monitored? Is relevant corrective information identified as well? •  Misinformation dynamics –  Are patterns of misinformation flow identified and predicted? Is demographic and behavioural information considered to understand and predict misinformation dynamics? •  Content Validation –  Is misinformation validated and fact checked? Are the users involved in the content validation process? •  Misinformation management –  Are citizens’ perceptions and behaviour with regards to processing and sharing misinformation studied and monitored? Are intervention strategies put in place to handle the effects of misinformation?
  38. 38. 38! Alberto Mendelzon Workshop 21th May 2018 38! Misinformation Content Detection Network & propagation patternsInformation source Content Text/images/videos Context Lists of misleading sites specific features (hashtags, mentions) http://www.opensources.co/ Misinformation?
  39. 39. 39! Alberto Mendelzon Workshop 21th May 2018 39! Misinformation Dynamics Low content diversity and strong social reinforcement Homophily Polarisation Algorithmic ranking/ personalisation Social bubbles •  Misinform ation spreads faster and more widely across the network •  Misinformation can be attributed to/ spread by bots & crowdturfing •  Users that use more social words and affection are more susceptible to interact with bots •  Extroverts are more prone to share misinformation •  Users tend to select and share content based on homogeneity (echo chambers). An effect exacerbated by ranking and personality algorithms •  In social media environments, where users are influenced by high information load and finite attention, low quality information is likely to go viral. •  Different types of misinformation spread differently. Scientific news have a higher level of diffusion but decay faster. Conspiracy theories are spread slower over longer time periods •  Even when denied, the rumour cascades continues to propagate
  40. 40. 40! Alberto Mendelzon Workshop 21th May 2018 40! Content Validation •  Full Fact, UK •  Snopes and Root Claim, US •  FactCheckNI, Northern Ireland •  Pagella Politica, Italy COMPUTATIONAL FACT CHECKER Automatically extract claims and validates them against a variety of information sources Knowledge Bases DBs of manually assessed facts by experts Crowdsourcing for annotation and/or verification Truth Teller Whether a claim is accepted by an individual is strongly influenced by the individual’s believe system (confirmation bias / motivated reasoning)
  41. 41. 41! Alberto Mendelzon Workshop 21th May 2018 41! Misinformation Management Simply presenting people with corrective information is likely to fail in changing their salient beliefs and opinions, or may, even, reinforce them Provide an explanation rather than a simple refute Expose the user to related but disconfirming stories Revealing the demographic similarity of the opposing group Expose the users to “small doses” of misinformation Combatting misinformation Facts Early detection of malicious accounts Use of ranking and selection strategies based on corrective information
  42. 42. 42! Alberto Mendelzon Workshop 21th May 2018 42! Comparison of Relevant Platforms
  43. 43. 43! Alberto Mendelzon Workshop 21th May 2018 43! Limitations •  Misinformation content detection –  Do not provide rationale or explanation of their decisions –  Disengage users by regarding them as passive consumers rather than as active co-creators and detectors of misinformation •  Misinformation dynamics –  Do not consider the typology and topology of the different networks –  Do not take into account how the misinformation-handling behaviour of users influences the spread of misinformation •  Content Validation –  Not able to cope with the high volume of misinformation generated online –  Often disconnected from where the users tend to read, debate and share misinformation. •  Misinformation management –  Tend to focus on the technical and not on the human aspects of the problem (i.e., motivations and behaviours of the users when generating and spreading misinformation)
  44. 44. 44! Alberto Mendelzon Workshop 21th May 2018 44! Research Directions •  User Involvement –  Participation of all stakeholders, including end users, social scientists, computer scientists, educators, etc., in the co-design of their functions, user interfaces, and delivery methods •  Misinformation Dynamics –  Study how platform-specific and network-specific features influence the dynamics of misinformation •  Content Validation –  Embed fact checkers into the environments where users tend to read, debate, and share misinformation (plugins) •  Misinformation Management –  Understanding user behaviour towards misinformation, what opinions users form about it, and how these opinions evolve over time, are key to successfully manage the impact of misinformation. –  Technology can be used to test the effectiveness of various misinformation management policies and techniques, as well as to deploy them at scale.
  45. 45. 45! Alberto Mendelzon Workshop 21th May 2018 Modeling Social Media Data SIOC: http://sioc-project.org/ M Fernandez, A Scharl, K Bontcheva, H Alani. User Profile Modelling in Online Communities. SWCS’14 Third International Workshop on Semantic Web Collaborative Spaces. ISWC 2014 http://oro.open.ac.uk/41395/
  46. 46. 46! Alberto Mendelzon Workshop 21th May 2018 Data Integration •  Social Networking Sites are like data silos –  Many isolated communities of users with their data •  The same user can participate in different social networks –  Miriam.fs / miriamfs / mfs •  The same topic can be discussed in different social networks –  Need ways to connect them •  To develop portable analysis models •  To allow users to access their data uniformly across SNS •  To allow automatic data portability from one SNS to another one Source: J.Breslin: The Social Semantic Web: An Introduction http://www.slideshare.net/Cloud/the-social-semantic-web-an-introduction
  47. 47. 47! Alberto Mendelzon Workshop 21th May 2018 Users / Content / Collaborative Environment Demographic characteristics •  Birthday •  Location •  Sex Preferences Social Network Collaborative Environment Behaviour Personality Content The User Needs SUM SUM MESHOUBO SIOC FOAF Schema.org Microformats SemSNA SIOC OPO Schema.org FOAF MESH MESH Domain of Discussion PAO
  48. 48. 48! Alberto Mendelzon Workshop 21th May 2018 Using SIOC to Model Twitter Data sioc:reply_of/ sioc:has_reply sioct: Microblog Post Tweet URL sioc:content Tweet Text dcterms:created Tweet creation time sioc:has_container/ sioc:container_of sioct: Microblog sioc:has_creator/ sioc:creator_of sioc:UserAccount sioc:name Screen name sioc:has_space/ sioc:space_of sioc:Site Twitter homepage sioc:topic sioct:Tag sioc:name Extracted hashtag sioc:links_to Extracted link sioc:mentions sioc:follows sioc:subscriber_of/ sioc:has_subscriber, sioc:isPartOf/ sioc:hasPart sioc:has_owner/ sioc:owner_of geo:long Tweet Longt. geo:lat Tweet Lat. gn:Feature sioc:about ... geo:Point geo:location dcterms:created Account creation time sioc:note Account description sioc:avatar Avatar URL User Twitter homepage User ID dcterms:title User name sioc:forwarded_by sioc:Container Twitter list ID sioc:addressed_to
  49. 49. 49! Alberto Mendelzon Workshop 21th May 2018 49 Mining Social Media Data, How?
  50. 50. 50! Alberto Mendelzon Workshop 21th May 2018 Analysis •  Behaviour Analysis •  Sentiment Analysis
  51. 51. 51! Alberto Mendelzon Workshop 21th May 2018 Behaviour Analysis (in a climate change context) Fernandez, M., Piccolo, L., Alani, H., Maynard, D., Meili, C., & Wippoo, M. (2017). Pro- Environmental Campaigns via Social Media: Analysing Awareness and Behaviour Patterns. The Journal of Web Science, 3(1). http://www.webscience-journal.net/webscience/article/view/44/30 Fernández, M., Burel, G., Alani, H., Piccolo, L. S. G., Meili, C., & Hess, R. (2015). Analysing engagement towards the 2014 earth hour campaign in Twitter. http://oro.open.ac.uk/43621/1/ENVINFO2015_v12.pdf
  52. 52. 52! Alberto Mendelzon Workshop 21th May 2018 52! Problem •  Individual behaviour change is a central strategy to mitigate climate change •  However, public engagement is still limited
  53. 53. 53! Alberto Mendelzon Workshop 21th May 2018 53! Problem •  Pro-environmental campaigns, particularly via social media •  Unclear how existing theories and studies of behaviour change can be applied to practical settings, particular social media campaigns, to better target and inform users
  54. 54. 54! Alberto Mendelzon Workshop 21th May 2018 54! Research Questions •  RQ1: How can we translate theories of behaviour change into computational methods to enable the automatic identification of behaviour? •  RQ2: How can the combination of theoretical perspectives and the automatic identification of behaviour help us to develop effective social media communication strategies for enabling behaviour change?
  55. 55. 55! Alberto Mendelzon Workshop 21th May 2018 55! Literature Review (I) •  Behaviour Change –  Socio-psychological models of behaviour (mainly at individual level) –  Theories of change (5 Doors Theory [Robinson])
  56. 56. 56! Alberto Mendelzon Workshop 21th May 2018 56! Literature Review (II) •  Intervention Strategies –  Information –  Discussions –  Public Commitment –  Feedback –  Social Feedback –  Goal Setting –  Collaboration –  Competition –  Rewards –  Incentives –  Personalisation Behavioural Stage Interventions Desirability Information Enabling Context Information, Rewards, Incentives Can Do Goal Setting, Public Commitment, Feedback Buzz Feedback, Social Feedback Invitation Promoting Collaboration
  57. 57. 57! Alberto Mendelzon Workshop 21th May 2018 Capturing and Categorising Behaviour •  Goal –  Automatic categorisation of users into behavioural stages following the 5 doors theory of behaviour change •  Analysis Methodology •  Based on questionnaire findings (212 participants) –  “There is a moderate relationship between the type of user-generated content and behaviour change stage” 1.  Manual inspection of the patterns describing each behavioural stage 2.  Feature engineering based on the identified patterns 3.  Supervised classification Behavioural Stage Posts Desirability I don’t understand why my energy bill is soooo expensive! Enabling Context I am considering walking or using public transport at least once a week
  58. 58. 58! Alberto Mendelzon Workshop 21th May 2018 Manual Inspection of Linguistic Patterns •  Desirability –  Negative sentiment (expressing personal frustration – anger / sadness) –  URLs (generally associated with facts) –  Questions (how can I? / what should I?) •  Enabling Context –  Neutral –  Conditional sentences (if you do [..] then […]) –  Numeric facts [consumption/pollution] + URL •  Can do –  Neutral sentiment –  Orders and suggestions (I/you should/must…) •  Buzz –  Positive sentiment (happiness / joy) –  (I/we + present tense) I am doing / we are doing •  Invitation –  Positive sentiment (happy / cute) –  [vocative] Friends, guys –  Join me / tell us / with me
  59. 59. 59! Alberto Mendelzon Workshop 21th May 2018 Feature Engineering •  Using an extension of the GATE NLP tools –  Polarity (positive/negative/neutral) –  Emotions •  Positive (joy/surprise/good/happy/cheeky/cute) •  Negative (anger/disgust/fear/sadness/bad/swearing) –  Directives •  Obligate (you must do) / imperative (do) / prohibitive (don’t do) •  Jussive or imperative in the 3rd person (go me!) •  Deliberative (shall / should we) / indirect deliberative (I wonder if) •  Conditionals (if / then) •  Questions (direct / indirect) –  URLs (yes / no) •  Indicates if the message points to external information https://gate.ac.uk/
  60. 60. 60! Alberto Mendelzon Workshop 21th May 2018 Behaviour Classification Model •  Multiple classifiers tested based on the sample of 2,610 annotated posts •  Best performing classifier J48 decision tree (71.2% accuracy)
  61. 61. 61! Alberto Mendelzon Workshop 21th May 2018 Experiments •  Analyse the behaviour of participants EH15 & COP21 •  Data Collection –  Participants of EH15 & COP21. Up to 3,200 posts per user •  Data Filtering –  Identify for each user her posts related to climate change/sustainability •  Use the term extraction tool ClimaTerm (GATE service) –  Based on Gemet / Reegle / DBPedia Movement Posts Users EH15 56,531,349 20,847 COP21 48,751,220 17,127 Movement Posts Users EH15 750,538 20,847 COP21 422,211 17,127
  62. 62. 62! Alberto Mendelzon Workshop 21th May 2018 62! Analysis of EH2015 and COP21 •  Categorise user behaviour in the months before/after
  63. 63. 63! Alberto Mendelzon Workshop 21th May 2018 63! Recommendations •  A big part of a campaign’s effort should be concentrated on providing messages with very concrete suggestions on climate change actions –  Most users are in the desirability stage: they want to change but they don’t know how •  There is a need to identify really engaged individuals and community leaders and involve them more closely in the campaigns –  Few users in the invitation stage and most of them are organisations –  For an invitation to be effective it is vital who issues the invitation •  Efforts should be dedicated towards engaging in discussions and providing direct feedback to users –  Communication in these campaigns generally functions as broadcasting, or one-way communication, from the organisations to the public –  Frequent and focused feedback is an intervention strategy that can help build self-efficacy and nudge the users in the direction of change
  64. 64. 64! Alberto Mendelzon Workshop 21th May 2018 Behaviour Analysis (in an Enterprise Context) Rowe, M., Fernandez, M., Angeletou, S., & Alani, H. (2013). Community analysis through semantic rules and role composition derivation. Web Semantics: Science, Services and Agents on the World Wide Web, 18(1), 31-47. Rowe, Matthew, and Harith Alani. "What makes communities tick? community health analysis using role compositions." Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom). IEEE, 2012. Rowe, M., Fernandez, M., Alani, H., Ronen, I., Hayes, C., & Karnstedt, M. (2012, June). Behaviour analysis across different types of Enterprise Online Communities. In Proceedings of the 4th Annual ACM Web Science Conference (pp. 255-264). ACM. Some of the next slides from: https://www.slideshare.net/mattroweshow
  65. 65. 65! Alberto Mendelzon Workshop 21th May 2018 The Need for Interpretation •  Online communities are dynamic behavioural ecosystems –  Users in communities can be defined by their roles •  i.e. Exhibiting similar collective behaviour –  Prevalent behaviour can impact upon community members and health •  Management of communities is helped by: –  Understanding the relation between behaviour and health •  How user behaviour changes are associated with health •  Encouraging users to modify behaviour, in turn affecting health –  e.g. content recommendation to specific users –  Predicting health changes •  Enables early decision making on community policy •  Can we accurately and effectively detect positive and negative changes in community health from its composition of behavioural roles? 65
  66. 66. 66! Alberto Mendelzon Workshop 21th May 2018 SAP Community Network •  Collection of SAP forums in which users discuss: –  Software development –  SAP Products –  Usage of SAP tools •  Points system for awarding best answers –  Enables development of user reputation •  Provided with a dataset covering 33 communities: –  Spanning 2004 - 2011 –  95,200 threads –  421,098 messages •  78,690 were allocated points –  32,942 users 020060010001400 PostCount 2004 2005 2006 2007 2008 2009 2010 2011
  67. 67. 67! Alberto Mendelzon Workshop 21th May 2018 Community Health Indicators •  From the literature there is no single agreed measure of ‘community health’ –  Multi-faceted nature: loyalty, participation, activity, social capital –  Different communities and platforms look at different indicators •  Indicator 1: Churn Rate (loyalty) –  The proportion of users who participate in a community for the final time •  Indicator 2: User Count (participation) –  The number of participating users in the community •  Indicator 3: Seeds-to-Non-Seeds Posts Proportion (activity) –  The Proportion of seed posts (i.e. thread starters that receive a reply) to non- seeds (i.e. no reply) •  Indicator 4: Clustering Coefficient (social capital) –  The average of users’ clustering coefficients within the largest strongly connected component
  68. 68. 68! Alberto Mendelzon Workshop 21th May 2018 Measuring Role Compositions I: Modelling and Measuring User Behaviour •  According to existing literature, user behaviour can be defined using 6 dimensions: –  (Hautz et al., 2010), (Nolker and Zhou, 2005), (Zhu et al., 2009), (Zhu et al., 2011) –  Focus Dispersion •  Measure: Forum entropy of the user –  Engagement •  Measure: Out-degree proportioned by potential maximal out-degree –  Popularity •  Measure: In-degree proportioned by potential maximal in-degree –  Contribution •  Measure: Proportion of thread replies created by the user –  Initiation •  Measure: Proportion of threads that were initiated by the user –  Content Quality •  Measure: Average points per post awarded to the user
  69. 69. 69! Alberto Mendelzon Workshop 21th May 2018 Measuring Role Compositions II: Inferring Roles •  1. Construct features for community users at a given time step •  2. Derive bins using equal frequency binning –  Popularity-low cutoff = 0.5, Initiation-high cutoff = 0.4 •  3. Use skeleton rule base to construct rules using bin levels –  Popularity = low, Initiation = high -> roleA –  Popularity < 0.5, Initiation > 0.4 -> roleA •  4. Apply rules to infer user roles and community composition •  5. Repeat 1-4 for following time steps
  70. 70. 70! Alberto Mendelzon Workshop 21th May 2018 Measuring Role Compositions III: Mining Roles (Skeleton rule base compilation) •  1. Select the tuning segment •  2. Discover correlated behaviour dimensions –  Removed Engagement and Contribution, kept Popularity (Pearson r > 0.75) •  3. Cluster users into behavioural groups •  4. Derive role labels for clusters hod and number of clusters - we measure the cohesion and aration of a given clustering as follows: For each clustering rithm (Ψ) we iteratively increase the number of clusters to use where 2 ≥ k ≥ 30. At each increment of k we rd the silhouette coefficient produced by Ψ, this is defined a given element (i) in a given cluster as: si = bi − ai max(ai, bi) (3) Where ai denotes the average distance to all other items he same cluster and bi is given by calculating the average ance with all other items in each other distinct cluster and taking the minimum distance. The value of si ranges ween −1 and 1 where the former indicates a poor cluster- where distinct items are grouped together and the latter cates perfect cluster cohesion and separation. To derive silhouette coefficient (s(Ψ(k)) for the entire clustering take the average silhouette coefficient of all items. We that the best clustering model and number of clusters to is K-means with 11 clusters. We found that for smaller ter numbers (k = [3, 8]) each clustering algorithm achieves parable performance, however as we begin to increase the ter numbers K-means improves while the two remaining rithms produce worse cohesion and separation. ) Deriving Role Labels: Provided with the most cohesive separated clustering of users we then derive role labels each cluster. Role label derivation first involves inspecting dimension distribution in each cluster and aligning the ibution with a level mapping (i.e. low, mid, high). This bles the conversion of continuous dimension ranges into rete values which our rule-based approach requires in the eton Rule Base. To perform this alignment we assess the Fig. 2. Boxplots of the feature distributions in each of the 11 clusters. Feature distributions are matched against the feature levels derived from equal- frequency binning TABLE II MAPPING OF CLUSTER DIMENSIONS TO LEVELS. THE CLUSTERS ARE ORDERED FROM LOW PATTERNS TO HIGH PATTERNS TO AID LEGIBILITY. Cluster Dispersion Initiation Quality Popularity 1 L L L L 0 L M H L 6 L H M M 10 L H M H 4 L H H M 2,5 M H L H 8,9 M H H H 7 H H L H 3 H H H H decision node, we measure the entropy of the dimensions and their levels across the clusters, we then choose the dimension with the largest entropy. This is defined formally as: H(dim) = − |levels| level p(level|dim) log p(level|dim) (4) 0 1 2 3 4 5 6 7 8 9 0.00.20.40.6 Cluster Dispersion 0 1 2 3 4 5 6 7 8 9 0.000.010.020.030.04 Cluster Initiation 0 1 2 3 4 5 6 7 8 9 0246810 Cluster Quality 0 1 2 3 4 5 6 7 8 9 0.0000.0050.0100.0150.020 Cluster Popularity •  1 - Focussed Novice •  2,5 - Mixed Novice •  7 - Distributed Novice •  3 - Distributed Expert •  8,9 - Mixed Expert •  0 - Focussed Expert Participant •  4 - Focussed Expert Initiator •  6 - Knowledgeable Member •  10 - Knowledgeable Sink
  71. 71. 71! Alberto Mendelzon Workshop 21th May 2018 Health Indicator Regression •  Managing online communities is helped by understanding the relation between behaviour and health −200 200 600 −2000100 Churn Rate PC1 PC2 101 161 197198210226252 256 264 265 270 319 353 354 412 413 414 418 419 420 44470 50 56 −800 −400 0 400 −2000100 User Count PC1 PC2 101 161197198210226 252 256 264 265270319 353 354 412 413414 418419 420 44 470 50 56 −400 0 200 −1000100200300 Seeds / Non−seeds Prop PC1 PC2 101 161197 198210 226252256 264 265 270 319 353 354 412 413414 418 419 42044 470 50 56 −600 −200 200 −150−50050100 Clustering Coefficient PC1 PC2 101 161 197 198 210 226 252 256 264 265 270319 353 354412413414 418 419 420 44 470 50 56 No global composition pattern for the entirety of SCN •  Identified key differences as to ‘What makes Communities tick’ •  Decrease in Focussed Experts correlated with an increase in Seeds-to-Non-Seeds !
  72. 72. 72! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis Saif, H., Fernandez, M., Kastler, L., & Alani, H. (2017). Sentiment lexicon adaptation with context and semantics for the social web. Semantic Web, 8(5), 643-665. Saif, H., He, Y., Fernandez, M., & Alani, H. (2016). Contextual semantics for sentiment analysis of Twitter. Information Processing & Management, 52(1), 5-19. http://oro.open.ac.uk/42471/ Saif, H., Ortega, F. J., Fernández, M., & Cantador, I. (2016). Sentiment analysis in social streams. In Emotions and Personality in Personalized Services (pp. 119-140). Springer, Cham. Saif, H., Fernandez, M., He, Y., & Alani, H. (2014, May). Senticircles for contextual and conceptual semantic sentiment analysis of twitter. In European Semantic Web Conference (pp. 83-98). Springer, Cham. Saif, Hassan, Miriam Fernández, Yulan He, and Harith Alani. "On stopwords, filtering and data sparsity for sentiment analysis of twitter." (2014): 810-817. Some of the next slides from: https://www.slideshare.net/Staano/
  73. 73. 73! Alberto Mendelzon Workshop 21th May 2018 OutLine o Definitions o Brief History o  Traditional Sentiment Analysis o  Applications o Sentiment Analysis on Social Media o  Significance o  Challenges o Semantic Sentiment Analysis o  Contextual Semantics o  Conceptual Semantics o Discussion
  74. 74. 74! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis •  Recent field of study that analyzes people’s attitudes towards entities – individuals, organizations, products, services, events - topics, and their attributes (Liu, 2012) •  Interchangeably used along with Opinion Mining, –  although they are technically different tasks –  Opinion Mining: Extract the piece of text which represents the opinion •  I have recently upgraded to iPhone 5. I am not happy with the screen size, but the camera is absolutely amazing –  Sentiment Analysis: Extract the polarity of the opinion •  I am not happy with the screen size •  The camera is absolutely amazing
  75. 75. 75! Alberto Mendelzon Workshop 21th May 2018 75 Why? Because Opinion Matter! What Does the public Think?
  76. 76. 76! Alberto Mendelzon Workshop 21th May 2018
  77. 77. 77! Alberto Mendelzon Workshop 21th May 2018 http://www.datameer.com/blog/
  78. 78. 78! Alberto Mendelzon Workshop 21th May 2018
  79. 79. 79! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis Tasks Ø  Subjectivity Detection Ø  Polarity Detection Ø  Sentiment Strength Detection Ø  Emotions Detection Ø  Sentiment Summarization Levels Ø  Subjectivity Detection Ø  Polarity Detection Ø  Sentiment Strength Detection Ø  Emotions Detection Data Types Ø  Conventional Data Ø  Microblogging Data Approaches Ø  Machine Learning Ø  Lexicon-based Ø  Hybrid Sentiment Analysis
  80. 80. 80! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis Tasks •  Subjectivity Detection –  Detect whether the text is objective or subjective •  Polarity Detection –  Detect whether the text is positive or negative •  Sentiment Strength Detection –  Detect the strength of the subjective text •  Emotions Detection –  Detect the human emotions and feelings expressed in text (e.g., “happiness”, “sadness”, “anger”)
  81. 81. 81! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis Levels Word/Entity/Aspect Level •  Given a word w in a sentence s, decide whether this word is opinionated (i.e., express sentiment) or not Phrase-level (expression-level) •  Given a multi-word expression e in a sentence s, the task is to detect the sentiment orientation of e. (I’m very happy) Sentence-level •  Given a sentence s of multiple words and phrases, decide on the sentiment orientation of s Document-level •  Given a document d, decide on the overall sentiment of d
  82. 82. 82! Alberto Mendelzon Workshop 21th May 2018 Sentiment Analysis Approaches Lexicon- Based Approach Machine Learning Approach
  83. 83. 83! Alberto Mendelzon Workshop 21th May 2018 Machine Learning Approaches •  Supervised Classifiers: Naïve Bayes, MaxEnt, SVM, J48, etc. •  Unsupervised Classifiers: k-means, hierarchical clustering, HMM, SOM •  Semi-Supervised Classifiers: Label propagation and graph-based models
  84. 84. 84! Alberto Mendelzon Workshop 21th May 2018 Lexicon-based Approaches I had nightmares all night long last night :( Negative Sentiment Lexicon Text Processing Algorithm great sad down wrong horrible mistake love good MPQA, SentiWordNet, LIWC, etc. ! Lexicon generation Approaches •  Manual •  Dictionary-based •  Corpus-based !
  85. 85. 85! Alberto Mendelzon Workshop 21th May 2018 85! Data Existing SA methods are designed to function on Formal Text, that is: 1.  Long enough 2.  Well-Structured 3.  Formal Sentences Social Media Text is often •  Short! •  Noisy and messy •  Have informal, and ill-structured sentences
  86. 86. 86! Alberto Mendelzon Workshop 21th May 2018 Challenges to Traditional Approaches Machine Learning Approaches o  Classifier Training o  Labelled Corpora o  Labor Intensive Task o  Domain-Specific o  Re-Training with new domains o  Data Sparsity
  87. 87. 87! Alberto Mendelzon Workshop 21th May 2018 87! Challenges to Traditional Approaches •  Machine Learning Approaches o  Data Sparsity o  Twitter data are more sparse than conventional Data (Saif et., 2012) o  Singleton Words constitute two-third of the words in tweets! 0%# 10%# 20%# 30%# 40%# 50%# 60%# 70%# 80%# 90%# 100%# OMD# HCR# STS5Gold# SemEval# WAB# GASP# TF=1# TF>1#
  88. 88. 88! Alberto Mendelzon Workshop 21th May 2018 88! Challenges to Traditional Approaches Lexicon-based Approaches o  Sentiment Lexicons (e.g., MPQA, SentiWordNet) o  Not tailored to Twitter noisy data: o  Fixed Number of words Sentiment Lexicon great sad down wrong horriblemistake love goodgrt8lol :) :P ? Need Lexicon Adaptation!
  89. 89. 89! Alberto Mendelzon Workshop 21th May 2018 I had a great pain in my lower back this morning :( Sentiment in practice is usually conveyed through the latent semantics or meaning of words in texts! Ebola is spreading in Africa and ISIS in Middle East! Great Pain Negative ISIS -> Militant GroupEbola -> Virus/Disease Negative Negative Sentiment is Dynamic, domain-dependent, and…
  90. 90. 90! Alberto Mendelzon Workshop 21th May 2018 Semantic Sentiment Analysis (SentiCircles) SentiCircles •  Semantic Representation of words that captures their contextual sentiment orientation and strength in tweets (Saif et al., 2014) •  Captures Contextual & Conceptual Semantics of words •  Does not rely on the structure of tweets •  Provides lexicon-based sentiment analysis: –  Tweet-level –  Entity-level Semantic sentiment analysis aims at extracting and using the underlying semantics of words/aspects in identifying their sentiment orientation with regards to their context in the text !
  91. 91. 91! Alberto Mendelzon Workshop 21th May 2018 Distributional Semantic Hypothesis Trojan Horse Threat Hack Code Malware Program Dangerous Harm Trojan Horse Greek Tale History ClassWooden Troy “Words that occur in similar context tend to have similar meaning” Wittgenstein (1953)
  92. 92. 92! Alberto Mendelzon Workshop 21th May 2018 Capturing Contextual Semantics Term (m) C1 C2 Cn…. Context-Term Vector Degree of Correlation Prior SentimentSentimen t Lexicon 3 Capturing and Representing Semantics for Sentiment Analysis In the following we explain the SentiCircle approach and its use of contextual and con- ceptual semantics. The main idea behind our SentiCircle approach is that the sentiment of a term is not static, as in traditional lexicon-based approaches, but rather depends on the context in which the term is used, i.e., it depends on its contextual semantics. We define context as a textual corpus or a set of tweets. To capture the contextual semantics of a term we consider its co-occurrence patterns with other terms, as inspired by [27]. Following this principle, we compute the semantics of a term m by considering the relations of m with all its context words (i.e., words that occur with m in the same context). To compute the individual relation between the term m and a context term ci we propose the use of the Term Degree of Correlation (TDOC) metric. Inspired by the TF-IDF weighting scheme this metric is computed as: TDOC(m, ci) = f(ci, m) ⇥ log N Nci (1) where f(ci, m) is the number of times ci occurs with m in tweets, N is the total number of terms, and Nci is the total number of terms that occur with ci. In addition to each TDOC computed between m and each context term ci, we also consider the Prior Sentiment of ci, extracted from a sentiment lexicon. As with common practice, if this term ci appears in the vicinity of a negation, its prior sentiment score is negated. The negation words are collected from the General Inquirer under the NOTLW category.4 (1) (2) Trojan Horse threat attack (3) Contextual Sentiment Strength Contextual Sentiment Orientation Positive, Negative Neutral [-1 (very negative) +1 (very positive)]
  93. 93. 93! Alberto Mendelzon Workshop 21th May 2018 SentiCircles The SentiCircle Approach Term (m) C1 Degree of Correlation Prior Sentiment Trojan Horse Context Terms X = R * COS(θ) Y = R * SIN(θ) Dangerou s X ri θi xi yi PositiveVery Positive Very Negative Negative +1 -1 +1-1 Neutral Region ri = TDOC(Ci) θi = Prior_Sentiment (Ci) * π threat destroy Malicious attac k easil y discoveruseful fixC1Dangerous Overall Contextual Sentiment (Senti- Median) where the geometric median is a point g = (xk, yk) in which its Euclidea to all the points pi is minimum. We call the geometric median g the Senti-M captures the sentiment (y-coordinate) and the sentiment strength (x-coordin SentiCircle of a given term m. Following the representation provided in Figure 1, the sentiment of the dependent on whether the Senti-Median g lies inside the neutral region, t quadrants, or the negative quadrants. Formally, given a Senti-Median gm o the term-sentiment function L works as: L(gm) = 8 < : negative if yg < positive if yg > + neutral if |yg|  & xg  0 where is the threshold that defines the Y-axis boundary of the neutral region illustrates how this threshold is computed.
  94. 94. 94! Alberto Mendelzon Workshop 21th May 2018 Examples
  95. 95. 95! Alberto Mendelzon Workshop 21th May 2018 Tweet-Level Contextual Sentiment (I) (1) The Median Method Cycling under a heavy rain.. what a #luck! S-Median S-Median S-Median S-Median S-Median S-Median The Median of Senti-Medians
  96. 96. 96! Alberto Mendelzon Workshop 21th May 2018 Tweet-Level Contextual Sentiment (II) (2) The Pivot Method like1 X Y r1 θ1 PositiveVery Positive Very Negative Negative new2 pj r2 θ2 like1 new2 iPadj Wn Sj1 Sj2 Tweet tk ... ian Method: This method takes the median of all Senti-Medians, and this all tweet terms to be equal. Each tweet ti 2 T is turned into a vector of Senti- g = (g1, g2, ..., gn) of size n, where n is the number of terms that compose the d gj is the Senti-Median of the SentiCircle associated with term mj. Equation d to calculate the median point q of g, which we use to determine the overall nt of tweet ti using Function 6. t Method: This method favours some terms in a tweet over others, based on mption that sentiment is often expressed towards one or more specific targets, e refer to as “Pivot” terms. In the tweet example above, there are two pivot iPhone” and “iPad” since the sentiment word “amazing” is used to describe hem. Hence, the method works by (1) extracting all pivot terms in a tweet and; mulating, for each sentiment label, the sentiment impact that each pivot term from other terms. The overall sentiment of a tweet corresponds to the sentiment h the highest sentiment impact. Opinion target identification is a challenging is beyond the scope of our current study. For simplicity, we assume that the ms are those having the POS tags: {Common Noun, Proper Noun, Pronoun} in For each candidate pivot term, we build a SentiCircle from which the sentiment hat a pivot term receives from all the other terms in a tweet can be computed. y, the Pivot-Method seeks to find the sentiment ˆs that receives the maximum nt impact within a tweet as: ˆs = arg max s2S Hs(p) = arg max s2S Np X i NwX j Hs(pi, wj) (7) 2 S = {Positive, Negative, Neutral} is the sentiment label, p is a vector of I like my new iPad
  97. 97. 97! Alberto Mendelzon Workshop 21th May 2018 Performance {Tweet-level sentiment analysis} 40.00 50.00 60.00 70.00 80.00 MPQA-Lex Sen'WNet-Lex Sen'Circle Polarity Detec-on Accuracy F-Measure 62.00 64.00 66.00 68.00 70.00 72.00 74.00 Accurcy F1 Polarity Detec-on Sen'Strength Sen'Circle {Entity-level sentiment analysis} 30 40 50 60 70 80 90 MPQA-Lex SentiWNet-Lex SentiStrength SentiCircle Subjectivity Detection Accurcy F1 65 70 75 80 85 90 MPQA SentiWordNet SentiStrength SentiCircle Polarity Detection Accurcy F1 +30-40% +2-15% +20% +1/-1%
  98. 98. 98! Alberto Mendelzon Workshop 21th May 2018 Enriching SentiCircles with Conceptual Semantics •  Semantic Extracted from external knowledge sources (e.g., ontologies and semantic networks). ISIS is spreading in the Middle East like Cancer! What a sad day, 4 doctors were lost to Ebola today! Finally, I got my iPhone 6s, What a product!! Jihadist Militant Virus Apple- Product
  99. 99. 99! Alberto Mendelzon Workshop 21th May 2018 Enriching SentiCircles with Conceptual Semantics Cycling under a heavy rain.. What a #luck! Weather Condition Wind Snow Humidity 68.00% 70.00% 72.00% 74.00% 76.00% 78.00% Precision Recall F1 Unigrams POS Semantics {Tweet-level sentiment analysis} +4%
  100. 100. 100! Alberto Mendelzon Workshop 21th May 2018 •  Typical Sentiment Lexicons: –  Context-insensitive sentiment –  Fixed set of words •  Lexicon Adaptation –  Update the sentiment of words in a given lexicon with respect to their contextual in text. •  Cold beer -> Positive •  Great Pain -> Negative Tweets Extract Contextual Sentiment Rule-based Lexicon Adaptation Sentiment Lexicon Adapted Lexicon Lexicon Adaptation with SentiCircles Sentiment Lexicon Adaptation
  101. 101. 101! Alberto Mendelzon Workshop 21th May 2018 Words Found in the Lexicon 9.6% Words flipped their sentiment orientation 33.82 Words changed their sentiment Strength 62.94 Words remained unchanged 3.24 New Opinionated words 21.37 Words in Thelwall-Lexicon were adapted based on their context in three different datasets: OMD, HCR, STS (Saif et al., 2013) Adaptation Impact on Thelwall-Lexicon Adaptation Impact! 66.29 61.4 69.29 66.03 55 60 65 70 Accuracy F1 Original Lexicon Adapted Lexicon
  102. 102. 102! Alberto Mendelzon Workshop 21th May 2018 •  SentiCircles can effectively captures the contextual semantics and sentiment at the corpus level •  Provides Lexicon-based (Unsupervised Sentiment Analysis •  Provides domain-specific Sentiment Analysis •  Low Complexity –  Does not rely on the sentence Structures •  Not tailored to tweet-level / sentence-level context •  Sensitive to imbalanced sentiment class distribution •  Not very effective with small Twitter datasets Strengths and Limitations
  103. 103. 103! Alberto Mendelzon Workshop 21th May 2018 103 Take off Message
  104. 104. 104! Alberto Mendelzon Workshop 21th May 2018 Take off Message •  Social Media data can be mined for multiple applications •  It’s a great way to understand social phenomena at scale! •  This research must be interdisciplinary •  When using and studying social media we need to be very aware of the problems (ethics / biases / misinformation) •  A “pinch” of semantics goes a long way J THX A LOT FOR LISTENING! J
  105. 105. 105! Alberto Mendelzon Workshop 21th May 2018 105 Let’s Download some Twitter Data ☺
  106. 106. 106! Alberto Mendelzon Workshop 21th May 2018 Time to Play! •  Automatic data collection generally relies on JSON APIs and OAuth credentials. For example, for Twitter, you need to: 1.  Create a Twitter account (https://twitter.com). 2.  Obtain an OAuth access credentials (i.e., access token, access secret, consumer key and consumer secret) (https://apps.twitter.com/app/new). 3.  Use Search API for collecting tweets (https://developer.twitter.com). 4.  Save Tweets in JSON or other format for later analysis.

×