Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making More Sense Out of Social Data

2,566 views

Published on

Keynote at Workshop on Linked Science - Making Sense Out of Data - International Semantic Web Conference 2014.

  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Making More Sense Out of Social Data

  1. 1. Making More Sense Out of Social Data Harith Alani h+p://people.kmi.open.ac.uk/harith/ @halani harith-alani @halani 4th Workshop on Linked Science 2014— Making Sense Out of Data (LISC2014) ISWC 2014 -­‐ Riva del Garda, Italy
  2. 2. Topics • Social media monitoring" • Behaviour role analysis" • Semantic sentiment " • Engagement in microblogs" • Cross platform and topic studies" • Semantic clustering" • Application examples"
  3. 3. Take home messages • Social media has many more challenges and opportunities to offer" • Fusing semantics and statistical methods is gooood" • Studying isolated social media platforms is baaaad … or not good enough … anymore!"
  4. 4. Sociograms • Capturing and graphing social relationships" • Moreno founder of sociograms and sociometry" • Assessing psychological well-being from social configurations of individuals and groups" Friendship Choices Among Fourth Graders (from Moreno, 1934, p. 38 h+p://diana-­‐jones.com/wp-­‐content/uploads/EmoRons-­‐Mapped-­‐by-­‐New-­‐Geography.pdf
  5. 5. Computational Social Science Behaviour role Analysis “A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal pa+erns of individual and group behaviours.” “what does exisRng sociological network theory, built mostly on a foundaRon of one-­‐Rme “snapshot” data, typically with only dozens of people, tell us about massively longitudinal data sets of millions of people .. ?” Original slide by Markus Strohmaier h+p://gking.harvard.edu/files/LazPenAda09.pdf
  6. 6. Social semantic linking …. in 2003 ! • Domain ontologies • SemanRcs for integraRng people, projects, and publicaRons • IdenRfy communiRes of pracRce • Browse evoluRon of social relaRonships and collaboraRons Alani, H.; Dasmahapatra, S.; O'Hara, K.and Shadbolt, N. IdenRfying communiRes of pracRce through ontology network analysis. IEEE Intelligent Systems, 18(2) 2003.
  7. 7. Linking scientists …. in 2005 • Who is collaboraRng with whom? • How funding programmes impacted collaboraRons over Rme? data sources gatherers and mediators ontology knowledge repository (triplestore) applicaRons Alani, H.; Gibbins, N.; Glaser, H.; Harris, S. and Shadbolt, N. Monitoring research collaboraRons using semanRc web technologies. ESWC, Crete, 2005.
  8. 8. Bigger data, greater sociograms
  9. 9. Social Media
  10. 10. Jan 29, 2013 In-house Social Platforms
  11. 11. Tools for monitoring social networks
  12. 12. Reputation Monitoring • http://www.robust-project.eu/videos-demos "
  13. 13. Challenges and Opportunities • Integration" – How to represent and connect this data?" • Behaviour" – How can we measure and predict behaviour?" – Which behaviours are good/bad in which community type?" • Community Health" – What health signs should we look for? " – How to predict this health?" • Engagement" – How can we measure and maximise engagement? " • Sentiment" – How to measure it? " – Track it towards entities and contexts? "
  14. 14. Patterns
  15. 15. SemanRc Web & Linked Data SemanRc SenRment Analysis lurkers) ini#ators) followers) leaders) Macro/Micro Behaviour Analysis StaRsRcal Analysis Community Engagement Cumulative density functions of each dimension showing distributions for initiated and in-degree ratio and do not deviate away, at the other ex-treme users are found to post in a large range initiated (initiation) and in-degree ratio the density functions are skewed towards where only a few users initiate discussions to by large portions of the community. points per post (quality) is also skewed to-wards values indicating that the majority of users the best answers consistently. indicate that feature levels derived from distributions will be skewed towards lower values, initiated the definition of high for this anything exceeding 1.55x10−5. distribution of each dimension is shown in Fig-ure Figure 8: Boxplots of the feature distributions in each of the 11 Feature distributions are matched against the feature levels from equal-frequency binning ping. This mapping is shown in Table 2 where certain clusters are combined together as they have the same feature-level mapping patterns (i.e. 5,7 and 8,9). then interpreted the role labels from these clusters, and their subsequent patterns, as follows: • 0 - Focussed Expert Participant: this user type provides high quality answers but only within forums that they do not deviate from. They also have a mix of asking questions and answering them. • 1 - Focussed Novice: this user is focussed within few select forums but does not provide good qual-ity Technologies
  16. 16. MODELLING AND LINKING SOCIAL MEDIA DATA
  17. 17. June 25, 2013
  18. 18. Semantically-Interlinked Online Communities (SIOC) • SIOC is an ontology for representing and integrating data from the social web" • Simple, concise, and popular" SRll seeking the one size that’ll fit all sioc-project.org
  19. 19. SIOC for Discussion forums • SIOC is well tailored to fit discussion forum communities" • Needs extension to fit other communities, such as microblogs and Q&A"
  20. 20. Twitter in SIOC • Microblogs" • No forum structure"
  21. 21. IBM Connections in SIOC
  22. 22. SAP Community Network in SIOC
  23. 23. BEHAVIOUR ROLES
  24. 24. h+p://www.smrfoundaRon.org/wp-­‐content/uploads/2008/12/disRnguishing-­‐a+ributes-­‐of-­‐social-­‐roles.png
  25. 25. Why we monitor behaviour? • Understand role of people in a community • Monitor impact of behaviour on community evolution • Forecast community future • Learn which behaviour should be encouraged or discouraged • Find the best mix of behaviour to increase engagement in an online community • See which users need more support, which ones should be confined, and which ones should be promoted
  26. 26. Linking networks
  27. 27. Linking people via sensors, social media, papers, projects <?xml version="1.0"?>! <rdf:RDF! xmlns="http:// tagora.ecs.soton.ac.uk/schemas/ tagging#"! xmlns:rdf="http://www.w3.org/ 1999/02/22-rdf-syntax-ns#"! xmlns:xsd="http://www.w3.org/2001/ XMLSchema#"! xmlns:rdfs="http://www.w3.org/ 2000/01/rdf-schema#"! xmlns:owl="http://www.w3.org/ 2002/07/owl#"! xml:base="http:// tagora.ecs.soton.ac.uk/schemas/ tagging">! <owl:Ontology rdf:about=""/>! <owl:Class rdf:ID="Post"/>! <owl:Class rdf:ID="TagInfo"/>! <owl:Class rdf:ID="GlobalCooccurrenceInfo"/>! <owl:Class rdf:ID="DomainCooccurrenceInfo"/>! <owl:Class rdf:ID="UserTag"/>! <owl:Class rdf:ID="UserCooccurrenceInfo"/>! <owl:Class rdf:ID="Resource"/>! <owl:Class rdf:ID="GlobalTag"/>! <owl:Class rdf:ID="Tagger"/>! <owl:Class rdf:ID="DomainTag"/>! <owl:ObjectProperty rdf:ID="hasPostTag">! <rdfs:domain rdf:resource="#TagInfo"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasDomainTag">! <rdfs:domain rdf:resource="#UserTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="isFilteredTo">! • Integration of physical presence and online <rdfs:range information" rdf:resource="#GlobalTag"/>! • <rdfs:domain Semantic user profile generation" rdf:resource="#GlobalTag"/>! </owl:ObjectProperty>! • <owl:ObjectProperty Logging of face-to-face contact" rdf:ID="hasResource">! <rdfs:domain rdf:resource="#Post"/>! <rdfs:range =…! • Social network browsing" • Analysis of online vs offline social networks" Alani, H.; Szomszor, M.; Ca+uto, C.; den Broeck, W.; Correndo, G. and Barrat, A.. Live social semanRcs. ISWC, Washington, DC, 2009
  28. 28. 1.2" 1" 0.8" 0.6" 0.4" 0.2" 0" Online+offline social networks H.Index" F2F"Degree" F2F"Strength" 1" 5" 9" 13" 17" 21" 25" 29" 33" 37" 41" 45" • What’s your social configura-on? • What does it say about you? • And what you’ll become? Barrat, A.; C., Ca+uto; M., Szomszor; W., Van den Broeck and Alani, H. Social dynamics in conferences: analyses of data from the Live Social SemanRcs applicaRon. ISWC, Shanghai, China, 2010.
  29. 29. h+p://www.tehowners.com/info/Popular%20Culture%20&%20Social%20Media/Online%20CommuniRes.jpg
  30. 30. 1.000 0.274 0.086 0.909** 0.274 1.000 -0.059 0.513 0.086 -0.059 1.000 0.065 0.909** 0.513 0.065 1.000 Clustering for identifying emerging roles – Map the distribution of each feature in each cluster to a level (i.e. low, mid, high) – Align the mapping patterns with role labels Figure 8: Boxplots of the feature distributions in each of the 11 clus-ters. Mapping Table 2: Mapping of cluster of cluster dimensions dimensions to to levels levels Cluster Dispersion Initiation Quality Popularity 0 L M H L 1 L L L L 2 M H L H 3 H H H H 4 L H H M 5,7 H H L H 6 L H M M 8,9 M H H H 10 L H M H • 1 - Focussed Novice: focussed within a few select forums but does not provide good quality content. • 2 - Mixed Novice: a novice across a medium range of topics • 3 - Distributed Expert: expert on a variety of topics and participates across many different forums …. • 3 - Distributed Expert: an expert on a variety of topics and participates across many different fo-rums • 4 - Focussed Expert Initiator: similar to cluster 0 in that this type of user is focussed on certain topics and is an expert on those, but to a large ex-tent starts discussions and threads, indicating that his/her shared content is useful to the community • 5.7 - Distributed Novice: participates across a range of forums but is not knowledgeable on any topics
  31. 31. Encoding Roles in Ontologies with SPIN
  32. 32. Behaviour role extraction from Social Media Data Structural, social network, reciprocity, persistence, participation • Bottom Up analysis" – Every community member is classified into a “role”" – Unknown roles might be identified" – Copes with role changes over time " iniRators lurkers followers leaders Feature levels change with the dynamics of the community Associations of roles with a collection of feature-to-level mappings e.g. in-degree -> high, out-degree -> high Run rules over each user’s features and derive the community role composition Angeletou, S; Rowe, M, and Alani, H. Modelling and analysis of user behaviour in online communiRes. ISWC 2011, Bonn, Germany
  33. 33. Correlation of behaviour roles with community activity • How certain behaviour roles impact activity in different community types?" Forum on CommuRng and Transport Forum on Rugby Forum on Mobile Phones and PDAs
  34. 34. Community types • So do communities of different types behave differently? • Analysed IBM Connections communities to study participation, activity, and behaviour of users • Compare exhibited community with what users say they use the community for – Does macro behaviour match micro needs?
  35. 35. Community types Community Wiki Page Blog Post Forum Thread Wiki Edit Blog Comment Forum Reply Tag Bookmark File § Data consists of non-private info on IBM Connections Intranet deployment § Communities: § ID § Creation date § Members § Used applications (blogs, Wikis, forums) § Forums: § Discussion threads § Comments § Dates § Authors and responders
  36. 36. Community types • Muller, M. (CHI 2012) identified five distinct community types in IBM Connections:" – Communities of Practice (CoP): for sharing information and network" – Teams: shared goal for a particular project or client" – Technical Support: support for a specific technology" – Idea Labs Communities: for focused brainstorming " – Recreation Communities: recreational activities unrelated to work. " • Our data consisted of 186 most active communities:" – 100 CoPs, 72 Teams, and 14 Technical Support communities " – No Ideas of Recreation communities"
  37. 37. Behaviour roles in different community types • Members of Team communities are more engaged, popular, and initiate more discussions • Technical Support community members are mostly active in a few communities, and don’t initiate or contribute much! • CoP members are active across many communities, and contribute more Rowe, M. Fernandez, M., Alani, H., Ronen, I., Hayes, C., Karnstedt, M.: Behaviour Analysis across different types of Enterprise Online Communities. WebSci 2012
  38. 38. Behaviour roles and community health 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Churn Rate False Positive FPR Rate TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 User Count FPR TPR • Machine learning models to predict community health based on compositions and evolution of user behaviour • Churn rate: proportion of community leavers in a 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Seeds / Non−seeds Prop FPR TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Clustering Coefficient FPR TPR given time segment. • User count: number of users who posted at least once. • Seeds to Non-seeds ratio: proportion of posts that get responses to those that don’t • Cluster coefficient: extent to which the community forms a clique. Health categories 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Seeds / Non−seeds Prop FPR TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Clustering Coefficient FPR TPR False Positive Rate False Positive Rate False Positive Rate True Positive Rate True Positive Rate True Positive Rate True Positive Rate The fewer Focused Experts in the community, the more posts will received a reply! There is no “one size fits all” model! Rowe, M. and Alani, H. What makes communiRes Rck? Community health analysis using role composiRons. SocialCom 2012, Amsterdam, The Netherlands.
  39. 39. SEMANTIC SENTIMENT ANALYSIS
  40. 40. Semantic sentiment analysis on social media • Range of features and statistical classifiers have been used in social media sentiment analysis in recent years • Semantics have often been overlooked – Semantic Features – Semantic Patterns • Semantic concepts can help determining sentiment even when no good lexical clues are present
  41. 41. Sentiment Analysis hate negative honest positive inefficient negative Love positive … Sentiment Lexicon I really love the iPhone I hate the iPhone Lexical-Based Approach Naïve Bayes, SVM, MaxEnt , etc. Learn Model Apply Model Training Set Test Set Model Machine Learning Approach
  42. 42. Semantic Concept Extraction • Extract semantic concepts from tweets data and incorporate them into the supervised classifier training. OpenCalais and Zemanta. Their experimental results showed that AlchemyAPI best for entity extraction and semantic concept mapping. Our datasets consist informal tweets, and hence are intrinsically different from those used in [10]. There-fore we conducted our own evaluation, and randomly selected 500 tweets from the STS corpus and asked 3 evaluators to evaluate the semantic concept extraction outputs gen-erated from AlchemyAPI, OpenCalais and Zemanta. No. of Concepts Entity-Concept Mapping Accuracy (%) Extraction Tool Extracted Evaluator 1 Evaluator 2 Evaluator 3 AlchemyAPI 108 73.97 73.8 72.8 Zemanta 70 71 71.8 70.4 OpenCalais 65 68 69.1 68.7 Table 2. Evaluation results of AlchemyAPI, Zemanta and OpenCalais. The assessment of the outputs was based on (1) the correctness of the extracted entities; and (2) the correctness of the entity-concept mappings. The evaluation results presented in Table 2 show that AlchemyAPI extracted the most number of concepts and it also has the highest entity-concept mapping accuracy compared to OpenCalais and Zematna. As such, we chose AlchemyAPI to extract the semantic concepts from our three datasets. Table 3 lists the total number of entities extracted and the number semantic concepts mapped against them for each dataset. STS HCR OMD No. of Entities 15139 723 1194 No. of Concepts 29 17 14 Table 3. Entity/concept extraction statistics of STS, OMD and HCR using AlchemyAPI.
  43. 43. Impact of adding semantic features • Incorporating semantics increases accuracy against the baseline by: – 6.5% for negative sentiment, – 4.8% for positive sentiment – F1 = 75.95%, with 77.18% Precision and 75.33% Recall Destroy(((Invading(Germs(( Nega%ve' Nega%ve'Concept' • OK, but what about such cases? • Can semantics help? Saif, H., He, Y. and Alani, H. SemanRc senRment analysis of twi+er. ISWC 2012, Boston, US.
  44. 44. Semantic Pattern Approaches • Apply syntac-c and seman-c processing techniques • Use external semanRc resources (e.g. Dbpedia, Freebase) to idenRfy semanRc concepts in Tweets Threat Trojan Horse Hack Code Program Malware Dangerous Harm Spyware • Extract clusters of similar contextual semanRcs and senRment, and use as pa+erns in senRment analysis
  45. 45. Tweet-Level Sentiment Analysis Features Based on 9 Twi+er datasets MaxEnt Classifier Accuracy F-Measure Minimum Maximum Average Minimum Maximum Average Syntactic Twitter Features -0.23 3.91 1.24 -0.25 4.53 1.62 POS -0.89 2.92 0.79 -0.91 5.67 1.25 Lexicon -0.44 4.23 1.30 -0.38 5.81 1.83 Average -0.52 3.69 1.11 -0.52 5.33 1.57 Semantic Concepts -0.22 2.76 1.20 -0.40 4.80 1.51 LDA-Topics -0.47 3.37 1.20 -0.68 6.05 1.68 SS-Patterns 0.70 9.87 3.05 1.23 9.78 3.76 Average 0.00 5.33 1.82 0.05 6.88 2.32 Table 6: Win/Loss in Accuracy and F-measure of using different features for sentiment classifica-tion on all nine datasets. Win/Loss in Accuracy and F-­‐measure of using different features for senRment classificaRon on all nine datasets. classifier described in Section 4.2. Note that STS-Gold is the only dataset among the other 9 that provides named entities manually annotated with their sentiment labels (positive, negative, neutral). Therefore, our evaluation in this task is done using the Hassan S., He, Y., Miriam F.and Harith A., SemanRc Pa+erns for SenRment Analysis of Twi+er, ISWC 2014, Trento, Italy
  46. 46. Entity-Level Sentiment Analysis 67.00 65.00 63.00 61.00 59.00 57.00 55.00 Gold standard of 58 enRRes Accuracy F1 Unigrams LDA-­‐Topics SemanRc Concepts SS-­‐Pa+erns Hassan S., He, Y., Miriam F.and Harith A., SemanRc Pa+erns for SenRment Analysis of Twi+er, ISWC 2014, Trento, Italy
  47. 47. ONLINE ENGAGEMENT ENGAGEMENT ANALYSIS
  48. 48. Different Engagement Patterns Forum on a celebrity Forum on transport
  49. 49. Different Engagement Parameters
  50. 50. Different Engagement Parameters
  51. 51. … “few people took part” • 309 invitees from media, academia, and public engagement bodies" • 2 invitees contributed to the site, with 2 edits!!
  52. 52. Recipe for more engaging posts?
  53. 53. Ask the (Social) Data • What’s the model of good/bad tweets?" • What features are associated with each group?"
  54. 54. term influenced by external factors. Properties influencing popularity include content - generally referred to as content features. In Table 1 we define user and content features and study their influence on the discussion “continuation”. user attributes - describing the reputation of the user - and attributes of a post’s content - generally referred to as content features. In Table 1 we define user and content features and study their influence on the discussion “continuation”. Feature Engineering Table 1. User and Content Features User Features Table 1. User and Content Features In Degree: Number of followers of U # Out Degree: Number of users U follows # List Degree: Number of lists U appears User on. Features Lists group users by topic # Post Count: Total number of posts the user has ever posted # In Degree: Number of followers of U # Out Degree: Number of users U follows # List Degree: Number of lists U appears on. Lists group users by topic # Post Count: Total number of posts the user has ever posted # User Age: Number of minutes from user join date # Post Rate: Posting frequency of the user PostCount UserAge Content Features User Age: Number of minutes from user join date # Post Rate: Posting frequency of the user PostCount Post length: Length of the post in characters # Complexity: Cumulative entropy of the unique words in post p λ UserAge Content Features of total word length n and pi the frequency of each word ! i∈[1,n] pi(log λ−log pi) Post length: Length of the post in characters # Complexity: Cumulative entropy of the unique words in post p λ Uppercase count: Number of uppercase words # Readability: Gunning fog index using average sentence length (ASL) [7] of total word length n and pi the frequency of each word λ ! i∈[1,n] pi(log λ−log pi) and the percentage of complex words (PCW). 0.4(ASL + PCW) λ Uppercase count: Number of uppercase words # Verb Count: Number of verbs # Noun Count: Number of nouns # Readability: Gunning fog index using average sentence length (ASL) [7] and the percentage of complex words (PCW). 0.4(ASL + PCW) Adjective Count: Number of adjectives # Referral Verb Count: Count: Number Number of of @verbs user # # Time Noun in the Count: day: Number Normalised of nouns time in the day measured in minutes # # Informativeness: Terminological novelty of the post wrt other posts Adjective Count: Number of adjectives # Referral Count: The Number cumulative of @user tfIdf value of each term t in post p # Time in Polarity: the day: Cumulation Normalised time of polar in the term day weights measured in p in (using minutes # Informativeness: Terminological novelty of the post wrt other posts Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne The cumulative tfIdf value of each term t in post p ! t∈p tfidf(t, p) ! t∈p tfidf(t, p) Polarity: Cumulation of polar term weights in p (using |terms| Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne |terms| • Focus Features" – Topic entropy: the distribution of the author across community forums" – Topic Likelihood: the likelihood that a user posts in a specific forum given his post history" 4.2 Experiments Experiments are intended to test the performance of different classification mod-els • Measures the affinity that a user has with a given forum" • Lower likelihood indicates a user posting on an unfamiliar topic" 4.2 Experiments Experiments are intended to test the performance of different classification mod-els in identifying seed posts. Therefore we used four classifiers: discriminative classifiers Perceptron and SVM, the generative classifier Naive Bayes and the decision-tree classifier J48. For each classifier we used three feature settings: user features, content features and user+content features. in identifying seed posts. Therefore we used four classifiers: discriminative classifiers Perceptron and SVM, the generative classifier Naive Bayes and the
  55. 55. Classification of Posts Seed Posts Non-Seed Posts § Binary classification model § Trained with social, content, and combined features § 80/20 training/testing § Identify best feature types, and top individual features, in predicting post classification
  56. 56. Engagement on Boards.ie • Which posts are more likely to stimulate responses and discussions?" • What impacts engagement more; user features, post content, forum affinity?" • Which individual features are most influential?"
  57. 57. Top Features for Engagement on Boards.ie • Content features were key!" • Best predictions were achieved when combining user, content, and focus features" • URLs (Referral Count) in a post negatively impact discussion activity" • Seed Posts (posts that receive replies) are associated with greater activity levels, and because it has alreadfyorubme elinkeluihsoeodd"in other Lower informativeness is associated with seed posts" – i.e. seeds use investigations (e.g., [14]). Boards.ie does not provide explicit social relations be-tween community members, unlike for example Facebook and language that is familiar to the community" Twitter. We followed the same strategy proposed in [3] for extracting social networks from Digg, and built the Boards.ie social network for users, weighting edges cumulatively by the number of replies between any two users. TABLE I DESCRIPTION OF THE BOARDS.IE DATASET Posts Seeds Non-Seeds Replies Users 1,942,030 90,765 21,800 1,829,465 29,908 • Rowe, M.; Angeletou, S. and Alani, H. AnRcipaRng discussion acRvity on community forums. SocialCom 2011, Boston, MA, USA.
  58. 58. former dataset contains tweets which relate to the Haiti earthquake disaster, covering a varying timespan. The latter dataset contains all tweets published during the duration of president Barack Obama’s State of the Union Address speech. Our goal is to predict discussion activity based on the features of a given post by first identifying seed posts, before moving on to predict the discussion level. 12 user-age (0.015) content-noun-count (0.002) 15 13 content-adj-uppercase-count (count 0.005) (0.012) content-adj-readability count (0.0) (0.001) 16 14 content-complexity noun-count ((0.0) 0.010) content-informativeness verb-count (0.001) (17 15 adj-count (0.005) adj-count (0.0) 16 content-complexity (0.0) content-informativeness (17 content-verb-count (0.0) content-uppercase-count (Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) Upper plots are for the Haiti dataset and the lower plots are for the dataset. Top Features for Engagement on Twitter • Top are list-degree, in-degree, Within the above datasets many of the posts are not seeds, but are instead replies to previous posts, thereby featuring in the discussion chain as a node. In [13] retweets are considered as part of the discussion activity. In our work we identify discussions using the explicit “in reply to” information obtained by the Twitter API, which does not include retweets. We make this decision based on the work presented in boyd et.al [4], where an analysis of retweeting as a discussion practice is presented, arguing that message forwards adhere different motives which do not necessarily designate a response to the initial message. Therefore, we only investigate explicit replies to messages. To gather our discussions, and our seed posts, we iteratively move up the reply chain - i.from reply to parent post - until we reach the seed post in the discussion. We define this process as dataset enrichment, and is performed by querying Twitter’s REST API6 using the in reply to id of the parent post, and moving one-step a time up the reply chain. This same approach has been employed successfully in work by [12] to gather a large-scale conversation dataset from Twitter. informativeness, and #posts" " • Top are list-degree, time of posting, in-degree, and #posts" content-verb-count (0.0) content-uppercase-count (Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) Upper plots are for the Haiti dataset and the lower plots are for the dataset. HaiR Earthquake State Union Address Table 2. Statistics of the datasets used for experiments The top-most ranks from each dataset are dominated by user features Dataset Users Tweets Seeds Non-Seeds Replies Haiti 44,497 65,022 1,405 60,686 2,931 Union Address 66,300 80,272 7,228 55,169 17,875 Rowe, M., Angeletou, S., Alani, H. PredicRng Discussions on the Social SemanRc Web. ESWC, Crete, 2011 Table 2 shows the statistics that explain our collected datasets. One can
  59. 59. Top Features for Engagement on Twitter – Earth Hour 2014 neg pos 0 5 10 15 20 25 30 Length neg pos 0.0 0.5 1.0 1.5 Complexity neg pos 0 10 20 30 40 Readability neg pos −4 −2 0 2 4 Polarity • Top influential features do not match those found for Board.ie or for two non-random Twitter datasets"
  60. 60. Top Features for Engagement on Twitter – Dorset Police neg pos 5 10 15 20 25 30 Length neg pos 0.6 0.8 1.0 1.2 1.4 complexity neg pos −4 −3 −2 −1 0 1 2 3 polarity neg pos 0 1 2 3 4 5 6 7 mentions ! • Top 4 features share 3 with Twitter Earth Hour dataset" Fernandez, M., Cano, E., and Alani, H. Policing Engagement via Social Media. CityLabs workshop, SocInfo, Barcelona, 2014
  61. 61. Publications about social media by Katron Weller -­‐ h+p://kwelle.files.wordpress.com/2014/04/figure1.jpg
  62. 62. Moving on … § How can we move on from these (micro) studies? § Are results consistent across datasets, and platforms? § One way forward is: § Multiple platforms § Multiple topics
  63. 63. Papers studying single/multiple social media platforms Survey done on all submi7ed papers to Web Science conferences
  64. 64. Papers studying single/multiple social media platforms Survey done on all submi7ed papers to Web Science conferences
  65. 65. Papers studying single/multiple social media platforms Survey done on all submi7ed papers to Web Science conferences
  66. 66. Papers studying single/multiple social media platforms Survey done on all submi7ed papers to Web Science conferences
  67. 67. Apples and Oranges • We mix and compare different datasets, topics, and platforms • Aim is to test consistency and transferability of results
  68. 68. 7 datasets from 5 platforms Pla1orm Posts Users Seeds Non-­‐seeds Replies Boards.ie 6,120,008 65,528 398,508 81,273 5,640,227 Twi+er Random 1,468,766 753,722 144,709 930,262 390,795 Twi+er (HaiR Earthquake) 65,022 45,238 1,835 60,686 2,501 Twi+er (Obama State of Union Address) 81,458 67,417 11,298 56,135 14,025 SAP 427,221 32,926 87,542 7,276 332,403 Server Fault 234,790 33,285 65,515 6,447 162,828 Facebook 118,432 4,745 15,296 8,123 95,013 Seed posts are those that receive a reply Non-seed posts are those with no replies
  69. 69. Data Balancing Pla1orm Seeds Non-­‐seeds Instance Count Boards.ie 398,508 81,273 162,546 Twi+er Random 144,709 930,262 289,418 Twi+er (HaiR 1,835 60,686 3,670 Earthquake) Twi+er (Obama State of Union Address) 11,298 56,135 22,596 SAP 87,542 7,276 14,552 Server Fault 65,515 6,447 12,894 Facebook 15,296 8,123 16,246 Total 521,922 For each dataset, an equal number of seeds and non-seed posts are used in the analysis.
  70. 70. Classification Results Feature P R F1 Social 0.592 0.591 0.591 Content 0.664 0.660 0.658 Social+Content 0.670 0.666 0.665 (Random) (HaiR Earthquake) (Obama’s State Union Address) P R F1 0.561 0.561 0.560 0.612 0.612 0.611 0.628 0.628 0.628 P R F1 0.968 0.966 0.966 0.752 0.747 0.747 0.974 0.973 0.973 Feature P R F1 Social 0.542 0.540 0.539 Content 0.650 0.642 0.639 Social+Content 0.656 0.649 0.646 P R F1 0.650 0.631 0.628 0.575 0.541 0.521 0.652 0.632 0.629 P R F1 0.528 0.380 0.319 0.626 0.380 0.275 0.568 0.407 0.359 Feature P R F1 Social 0.635 0.632 0.632 Content 0.641 0.641 0.641 Social+Content 0.660 0.660 0.660 § Performance of the logisRc regression classifier trained over different feature sets and applied to the test set.
  71. 71. Effect of features on engagement Boards.ie β 2 1 0 −1 −2 Twitter Random β 1.0 0.5 0.0 −0.5 Twitter Haiti 6e+16 4e+16 2e+16 0e+00 −2e+16 −4e+16 −6e+16 Twitter Union 0.2 0.0 −0.2 β −0.4 −0.6 −0.8 Server Fault β 2.0 1.5 1.0 0.5 0.0 −0.5 −1.0 SAP β 5 0 −5 −10 Facebook β 0.5 0.4 0.3 0.2 0.1 0.0 −0.1 In−degree Out−degree Post Count Age Post Rate Post Length Referrals Count Polarity Complexity Readability Readability Fog Informativeness Logistic regression coefficients for each platform's features
  72. 72. Comparison to literature § How performance of our shared features compare to other studies on different datasets and platforms?
  73. 73. Positive impact Negative impact Mismatch Match Comparison to literature
  74. 74. Positive impact Negative impact Mismatch Match Comparison to literature
  75. 75. Let’s Share More Data!
  76. 76. Semantic Clustering • Statistical models play important roles in social data analyses • Keeping such models up to date often means regular, expensive, and time consuming retraining • Semantic Features are likely to decay more slowly than lexical features • Could adding semantics to the models extend their value and life expectancy? Cano, E., He, Y., Alani, H. Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs. ISWC 2014, Trento, Italy.
  77. 77. Semantic Representation of a Tweet <dbo:PresidentOfUnitedStateofAmerica> <skos:Nobel_Peace_Price_laureates> rdf:type dcterms:subject <dbp:Barack_Obama> dbprop:nationality American <skos:English-language_television_stations> <skos:PresidentsOfEgypt> <dbp:Hosni_Mubarak> <dbp:CNN> <dbp:Egypt> dbprop:languages <dbp:Egyptian_Arabic> <skos:Arab_republics> dcterms:subject dcterms:subject <dbp:Country> rdf:type rdf:type
  78. 78. Evolution of Semantics • Renewed DBpedia Graph snapshots are taken over time" • Semantic features updated based on new knowledge in DBpedia" v3.6 v3.7 v3.8 <Budget_Control_Act_of_2011> wikiPageWikiLink <Barack_Obama> <UnitedStatesPresidentialCandidates> <Hawaii> spouse <MechelleObama> birth1place wikiPageWikiLink
  79. 79. Experiments Extending fitness of model to proceedings epochs • 12,000 annotated tweets" • Adding Classes as clustering features provide best performance" Cross-­‐ Epoch 2010-­‐2011 2010-­‐2013 2011-­‐2013 Average F1 F1 F1 BoW 0.634 0.481 0.261 0.458 Category 0.683 0.539 0.524 0.582 Property 0.665 0.557 0.502 0.603 Resource 0.774 0.544 0.445 0.587 Class 0.691 0.665 0.669 0.675 Same-­‐ epoch 2010-­‐2010 2011-­‐2011 Average BoW 0.831 0.875 0.845
  80. 80. APPLICATIONS
  81. 81. What policymakers really want from Social Media? 1. "Fish where the fish is" – one interface to access multiple SNS" – layman monitoring of users and topics " 2. "My consistency first" – communicating with users in own constituency" – find local groups, events, and topics" 3. "What are their needs, complaints, and preferences?" – what citizens talk about, complain about" – what are the top 5-10 topics of the day" 4. Who should I talk to?" – who are the influential citizens" – whom to engage with" 5. What about Tomorrow?" – which topics will get hotter?" – which discussions are likely to grow further?" 6. Presence and popularity" – what writing recipe to follow to reach more people" 7. Privacy" – concerns on citizens’ privacy when extracting info" – concerns on their own privacy with 3rd party SNS access tools" Interviews with 31 policymakers
  82. 82. Wandhöfer, T.; Taylor, S.; Alani, H.; Zoshi, S.; Sizov, S.; et al. Engaging poliRcians with ciRzens on social networking sites: the WeGov Toolbox. IJEGR, 8(3), 2012
  83. 83. Monitoring SCN " Monitoring of evolution of community activities and level of contributions in SAP Community Networks – SCN " Demo
  84. 84. SCN Behaviour " Community managers can monitor behaviour composition of forums, and its association to activity evolution "
  85. 85. For Education https://twitter.com/OpenUniversity/status/346911297704714240
  86. 86. FB Groups Sentiment Macro Behaviour Micro Behaviour Topics
  87. 87. Course tutors Real Rme monitoring Behaviour Analysis SenRment Analysis Topic Analysis • How acRve the engaged the course group is? • How is senRment towards a course evolving? • Are the leaders of the group providing posiRve/negaRve comments? • What topics are emerging? • Is the group flourishing or diminishing? • Do students get the answers and support they need? Thomas, K.; Fernández, M.; Brown, S., Alani, H. OUSocial2: a plaxorm for gathering students’ feedback from social media. (Demo) ISWC 2014, Trento, Italy.
  88. 88. DEMO
  89. 89. Thanks to colaborators
  90. 90. Thanks to .. Hassan Saif Lara Piccolo Thomas Dickensen Gregoire Burel Miriam Fernandez Smitashree Choudhury Elizabeth Cano Matthew Rowe Keerthi Thomas Sofia Angeletou
  91. 91. Heads-up Semantic Patterns for Sentiment Analysis of Twitter Thursday 15.40 - Session: Social Media" Semantic Patterns for Sentiment Analysis of Twitter Thursday 16:00 - Session: Social Media" User Profile Modeling in Online Communities ! Sunday 2:05 pm - SWCS Workshop" OUSocial2: a pla1orm for gathering students’ feedback from social media (DEMO) The Topics they are a-­‐Changing — Characterising Topics with Time-­‐Stamped Semanc Graphs (POSTER)" ! Automac Stopword Generaon using Contextual Semancs for Senment Analysis of Twi_er (POSTER)

×