Measuring the Topical Specificity of Online Communities


Published on

Presentation from the Extended Semantic Web Conference 2013

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Measuring the Topical Specificity of Online Communities

  2. 2. Measuring the Topical Specificity of Online Communities1Why measure the topical specificity of Online Communities?
  3. 3. Sub-Community CreationMeasuring the Topical Specificity of Online Communities2¨  [Belak et al. 2011] identified community (topical) drift¨  We found the same…¨  Understanding topical specificity is important for:¤  Tracking community focus¤  Suggesting new communities0 2 4 6 8 SpecificityLow Specificity0.00 0.10 0.20 0.300.−DivergenceF(JS−Divergence)High SpecificityLow Specificity
  4. 4. Theory of Attention DynamicsMeasuring the Topical Specificity of Online Communities3Ignorance isn’t Bliss:An Empirical Analysis of Attention Patterns inOnline CommunitiesClaudia Wagner⇤, Matthew Rowe†, Markus Strohmaier‡, and Harith Alani†⇤Institute of Information and Communication Technologies, JOANNEUM RESEARCH, Graz, AustriaEmail:†Knowledge Media Institute, The Open University, Milton Keynes, UKEmail:,‡ Knowledge Management Institute and Know-Center, Graz University of Technology,Graz, AustriaEmail: markus.strohmaier@tugraz.atAbstract—Online community managers work towards buildingand managing communities around a given brand or topic. Arisk imposed on such managers is that their community may dieout and its utility diminish to users. Understanding what drivesattention to content and the dynamics of discussions in a givencommunity informs the community manager and/or host with thefactors that are associated with attention, allowing them to detecta reduction in such factors. In this paper we gain insights intothe idiosyncrasies that individual community forums exhibit intheir attention patterns and how the factors that impact activitydiffer. We glean such insights through a two-stage approach thatfunctions by (i) differentiating between seed posts - i.e. posts thatsolicit a reply - and non-seed posts - i.e. posts that did not get anyreplies, and (ii) predicting the level of attention that seed postswill generate. We explore the effectiveness of a range of featuresfor predicting discussions and analyse their potential impact ondiscussion initiation and progress.Our findings show that the discussion behaviour of differentcommunities exhibit interesting differences in terms of howattention is generated. Our results show amongst others thatthe purpose of a community as well as the specificity of the topicof a community impact which factors drive the reply behaviourof a community. For example, communities around very specifictopics require posts to fit to the topical focus of the community inorder to attract attention while communities around more generaltopics do not have this requirement. We also found that thefactors which impact the start of discussions in communities oftendiffer from the factors which impact the length of discussions.Index Terms—attention, online communities, discussion, pop-another. For example, what catches the attention of users in aquestion-answering or a support-oriented community may nothave the same effect in conversation-driven or event-drivencommunities. In this paper we use the number of replies that agiven post on a community message board yields as a measureof its attention.To explore these and related questions, our paper sets outto study the following two research questions:1) Which factors impact the attention level a post gets incertain community forums?2) How do these factors differ between individual commu-nity forums?Understanding what factors are associated with attention indifferent communities could inform managers and hosts ofcommunity forums with the know-how of what drives attentionand what catches the attention of users in their community.Empowered with such information, managers could then detectchanges in such factors that could potentially impact commu-nity activity and cause the utility of the community to alter.We approach our research questions through an empiricalstudy of attention patterns in 20 randomly selected forums onthe Irish community message board study wasfacilitated through a two-stage approach that (i) differentiatesbetween seed posts - i.e. thread starters on a communityCommunity specificity has a bearing on attention dynamics
  5. 5. Community RecommendationMeasuring the Topical Specificity of Online Communities4¨  Users interested in a new topic often visit online communities¨  Recommending a specific community could overwhelm the user¤  Nuanced language, expert terms¨  Need a comparative assessment of specificity betweencommunitieshas sub-forum
  6. 6. Measuring the Topical Specificity of Online Communities5Can we empirically characterise how specific a givencommunity is based on what its users discuss?
  7. 7. What do we mean by ‘specificity’?Measuring the Topical Specificity of Online Communities6¨  We interpret a community forum’s specificity inrelation to its parent…
  8. 8. Measuring Topical Specificity:Our ApproachMeasuring the Topical Specificity of Online Communities7Retrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptabstractionP A c
  9. 9. Measuring Topical Specificity:Our ApproachMeasuring the Topical Specificity of Online Communities8Retrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptabstractionP A c
  10. 10. Measuring Topical Specificity:Our ApproachMeasuring the Topical Specificity of Online Communities9Retrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptabstractionP A c
  11. 11. Measuring Topical Specificity:Our ApproachMeasuring the Topical Specificity of Online Communities10Retrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A c
  12. 12. Extracting Concepts and Concept ModelsMeasuring the Topical Specificity of Online Communities11¨  Given: posts P published in forum f within [t,t’]¤  Extract entities from each post’s content using Zemanta¤  Get the concept that each entity is:a)  an instance of: <entity rdf:type class>b)  in the category of: <entity dcterms:subject category>¨  To build the concept model for the forum:¤  Record the frequency of each concept’s occurrencethat characterise the forum in the time period. We do this by processing eachpost content s 2 St0t00f using a concept extraction tool (s) to return the setof concepts related to the content of s. We build the concept model for thecommunity by recording the frequency of concept occurrences in the input postssets, returning At0t00f . This set is derived using the following construct:At0t00f [ci] = |{ci : ci 2 (s), s 2 St0t00f }| (2)4 Measuring Topical SpecificityConcept Extraction Model Set of post contents
  13. 13. Selecting Concepts using Composite Functions:Concept FrequencyMeasuring the Topical Specificity of Online Communities12Types CategoriesChoose the most frequently cited concept in the forum
  14. 14. Selecting concepts using Composite Functions:Concept Frequency-Inverse Forum FrequencyMeasuring the Topical Specificity of Online Communities13¨  Measures how unique a concept is to the forum¤  High CF-IFF = unique forum concept given other forums¨  We choose the concept that maximises CF-IFF…2. Concept Frequency-Inverse Forum Frequency: This functions selects the mostunique concept discussed in the forum with respect to all forums. This is amodification of the existing Term Frequency-Inverse Document Frequencymeasure used for term indexation. The Concept Frequency-Inverse ForumFrequency of each concept in a given forum is measured and the conceptthat returns the maximum value is chosen. The abstraction of this conceptis then measured and the reciprocal of this value taken as the specificity ofthe forum. We define the Concept Frequency-Inverse Forum Frequency asfollows:cf iff(c, f, F) =|At0t00f [c]|max⇣At0t00f [c0] : c0 2 At0t00f⌘ ⇥ log|F|{f 2 F : c0 2 At0t00f , c0 = c}(3)4.2 Concept Abstraction MeasuresThe composite functions decide on which concept to measure based on either: a)the frequency of the concept in the forum, or b) the uniqueness of the conceptwith respect to the other forums. To measure concept abstraction we define fivemeasures as follows, which either leverage the network structure surrounding aconcept or use the semantics of relations in the concept graph.Network Entropy. Our first measure of concept abstraction (a(c)) is based onThe normalised frequency of the conceptappearing within the forumHow common/rare a conceptis across the forums
  15. 15. Measuring SpecificityMeasuring the Topical Specificity of Online Communities14¨  Given a concept c selected using a compositefunction, how can we measure its specificity?¨  Solution: use information-theoretic measures ofabstraction…¤  Abstraction of concept c: a(c)¤  Specificity of concept c: 1/a(c)¨  We examine five measures of abstraction…
  16. 16. Measuring Abstraction: Network EntropyMeasuring the Topical Specificity of Online Communities15¨  Premise: more abstract tags co-occurs with manyother tags [Benz et al. 2011]¤  I.e. increase in variation of a random variable¨  We adapt this for concepts…¤  Derive co-occurrence frequencies between conceptsusing number of edges (relations) between them¤  Define conditional probability of concept co-occurrencework by [3] in which tag abstraction is measured through the uniformity of co-occurences. The general premise is that a more abstract tag should co-occur withmany other tags, thus producing a higher entropy - as there is more uncertaintyassociated with the term. In the context of our work we can also apply thesame notion, however we must adapt the notion of co-occurrence slightly to dealwith concepts. To begin with we need to define certain preamble that will allownetwork entropy, and the below network-theoretic measures, to be calculated,using the same definition as laid out in [4]: let G = {V, E, L} denote a concept-network, where c 2 V is the set of concept nodes, ecc0 2 E is an edge, orlink, connecting c, c02 V and lb(ecc0 ) 2 L denotes a label of the edge - i.e.the predicate associating c with c0. We can define the weight of the relationbetween two concepts c and c0by the number of times they are connected to oneanother in the graph: w(c, c0) = |{ecc0 2 E}|. From this weight measurement,derived from concept co-occurrence, we then derive the conditional probabilityof c appearing with c0as follows, using ego(c) to denote the ego-network of theconcept c - i.e. the triples in the immediate vicinity of c:p(c0|c) =w(c, c0)Xc002ego(c)w(c, c00)(4)Now that we defined the conditional probability of c appearing with anotherconcept c0, we define the network-entropy of c as follows:H(c) =Xc02ego(c)p(c0|c) log p(c0|c) (5)The immediate neighboursof concept c
  17. 17. Measuring Abstraction: Network CentralityMeasuring the Topical Specificity of Online Communities16¨  Premise: the more central a concept node is to anetwork, the greater its abstraction¤  Given the increased information flow through the node¨  We gauge centrality using two measures:¤  Degree Centralityn  Number of connections from a concept node divided by vertex setsize¤  Eigenvector Centralityn  Determine the position of the concept node based on theeigenstructure of the concept network
  18. 18. Measuring Abstraction: StatisticalSubsumptionMeasuring the Topical Specificity of Online Communities17¨  Premise: generality of a concept can be measuredthrough the number of concepts that it is broaderthan [Schmitz et al. 2006]¨  Graph semantics are used to measure the numberof specialisations/narrowings of a concept…order to count how many concepts a given concept c is more general than (weuse DBPedia datasets as our concept graphs which is explained in the followingsection).SUB(c) = |{c0: c02 V, ecc0 2 E, lb(ecc0 ) 2 {<skos:narrower>, <rdfs:subClassOf>}|(8)Key Player Problem. The final measure of abstraction that we use is takenfrom Navigli & Lapatta [7] and attempts to measure the extent to which a givennode in a network is a key player in the network’s topology; that is, the extent towhich it is important for information flow through the network. To compute thisIf the label of the predicate denotes a specialisation/narrowing
  19. 19. Measuring Abstraction: Key Player ProblemMeasuring the Topical Specificity of Online Communities18¨  Premise: concept node position as a key player innetwork topology can be used to gauge itsabstraction [Navaglia & Lapatta, 2010]¤  Measuring its importance for information flow¨  To derive Key Player Problem measure:¤  Measure shortest distance from each concept to everyother¤  Take the sum of the reciprocal of these distances¤  Normalise the sum by vertex set size
  20. 20. Approach EvaluationMeasuring the Topical Specificity of Online Communities19Rank forums byspecificityvaluesGenerateground truthrankComparepredicted andground truthRetrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A cˆd d{ }
  21. 21. Approach EvaluationMeasuring the Topical Specificity of Online Communities20Rank forums byspecificityvaluesGenerateground truthrankComparepredicted andground truthRetrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A cˆd d{ }
  22. 22. Approach EvaluationMeasuring the Topical Specificity of Online Communities21Rank forums byspecificityvaluesGenerateground truthrankComparepredicted andground truthRetrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A cˆd d{ }
  23. 23. Approach EvaluationMeasuring the Topical Specificity of Online Communities22Rank forums byspecificityvaluesGenerateground truthrankComparepredicted andground truthRetrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A cˆd d{ }
  24. 24. Approach EvaluationMeasuring the Topical Specificity of Online Communities23Rank forums byspecificityvaluesGenerateground truthrankComparepredicted andground truthRetrievecommunityposts in timewindow [t,t’]Derivecommunityconcept modelSelect conceptusing compositefunctionMeasureconceptspecificityP A cˆd d{ }We test combinations (20 in total) of:•  Composite function (2)•  Abstraction Measures (5)•  Concept Graphs (2)
  25. 25. Experimental SetupMeasuring the Topical Specificity of Online Communities24¨  Dataset:¤  Irish community message board¤  230 forums selected (filtered out low activity forums)¨  Selecting window of analysis:¤  Start date = 23/04/2005. Width = 1 weekwithin a k-week window, and found the densities to all be normally-distributedwith variance in their tails and skews. We wanted to select the most stable dis-tribution of posts across the forums and therefore measured the kurtosis andthe skewness of each window size’s distribution - as shown in Figure 1(b). Wethen chose the week that produced the minimum of these measures: 1 week. Bychoosing this time period we are provided with reduced variation in the forumpost distribution and therefore a stable picture, with no large fluctuations, ofcommunity activity.●● ●0 1000 2000 3000 4000 5000 6000 7000Posts−per−day(a) Boxplot of posts density in20050 2 4 6 8 10020406080Number of WeeksKurtosis|SkewnessKurtosisSkewness(b) Kurtosis and Skewness of den-sity distributionsFig. 1. Plots of posts-per-day distribution in 2005 (1(a)) and the distribution propertiesof posts-per-forum in increasing week windows from 23/3/2005 (1(b)).
  26. 26. Experimental Setup: Concept GraphsMeasuring the Topical Specificity of Online Communities25¨  Concept models come in two flavours:a)  Entity types (classes that an entity is an instance of)b)  Categories (skos categories that the entity belongs in)¨  Experiments use two concept graph types:a)  Type Graph: DBPedia Ontology Graphn  Nodes: classesn  Edges: ontological relationsb)  Category Graph: DBPedia Category Structuren  Nodes: categoriesn  Edges: skos relations (broader, narrower)
  27. 27. Experimental Setup: Evaluation MeasuresMeasuring the Topical Specificity of Online Communities26¨  Compare predicted rank against the ground truth rank¨  Measure rank quality using:¤  Kendall Tau-b (1 is better)n  Difference in the number of concordant and discordant pairs¤  Impurity@k (0 is better)n  Distance from each wrongly positioned forum to its true positionn  Measure for k={1,5,10,20,50,100} and averageshow:broader relations. Our evaluation therefore, not only looks for the op-timum combination of abstraction measure and composite function, but alsowhich concept graph to use: the Type graph or the Category graph.Table 1. Example rankings of forums in two predicted ranks from model 1 (M1)and model 2 (M2) together with the ground truth. The label function l(.) returns thelevel of the forum from the ground truth. Our evaluation measures (Kendall ⌧b andImpurity@k) are provided with the ordered levels as input.GT M1 M2Rank Index d l(d) ˆd1 l( ˆd1) ˆd2 l( ˆd2)1 a 1 c 2 a 12 b 1 d 2 b 13 c 2 g 3 c 24 d 2 h 3 d 25 e 2 a 1 f 26 f 2 e 2 g 37 g 3 i 3 e 28 h 3 b 1 h 39 i 3 j 3 i 310 j 3 f 2 j 3Evaluation Measures. To evaluate our approach we use the di↵erent combina-tions of: a) composite functions, b) abstraction measures, and c) concept graphs,to produce a predicted rank (ˆd) - ordering the most specific forum to the mostgeneral - which is then compared against a ground truth rank (d). The groundtruth rank of the forums is derived from the hierarchical structure of Boards.iewhich allows a given forum to be declared as either a parent or a child of anotherforum, thereby creating a nested structure. In this setting there are three levelsthat a given forum can be placed in: 1 is most specific, 3 is most general and 2 isin-between. In order to aid comprehension of our evaluation setting we presentexample rankings produced by two hypothetical models (M1 and M2) in Table1 along with the ground truth (GT). We refer to this evaluation setting as level-based ranking as each model (M1, M2) returns a level ordering (using a labelLevel-basedRanking
  28. 28. Experiments: ResultsMeasuring the Topical Specificity of Online Communities27¨  Best model: Eigenvector Centrality with Type Graph¤  For full rank: concept frequency¤  For top-k: CF-IFF¨  Type graph > category graph(Type graph with Concept Frequency and Eigenvector Centrality) we do slightlyworse than the random baseline, thereby failing to achieve the best performancewhen focussing on top-k ranks.Kendallτb−0.2− FrequencyCf−iffNetworkEntDegreeCentEigenvCentStatSubKPP(a) Types - Kendall ⌧bKendallτb− FrequencyCf−iffNetworkEntDegreeCentEigenvCentStatSubKPP(b) Categories - Kendall ⌧bAverageImpurity0.000.050.10Concept FrequencyCf−iffNetworkEntDegreeCentEigenvCentStatSubKPP(c) Types - Impurity@kAverageImpurity0.000.050.10Concept FrequencyCf−iffNetworkEntDegreeCentEigenvCentStatSubKPP(d) Categories - Impurity@kFig. 2. Plots of the results obtained when measuring forum specificity using: a) theLower is better Random model baseline
  29. 29. Evaluation: Qualitative InsightsMeasuring the Topical Specificity of Online Communities28¨  ‘Discworld’ appears top for Concept Frequency¤  Both measures return a similarly specific assessment¨  ‘Subscribers’ appears in CF-IFF lists:¤  Selected by the function, but different values obtained usingthe measuresduced by the models. Similarities are evident when the same composite functionis used: Discworld appears at the top of both abstraction measures when us-ing Concept Frequency - indicating that the concept selected from this forumhas the same specificity levels for both abstraction measures - while Subscribers,despite being a mid-level forum, appears towards the top rank of each abstrac-tion measure when using CF-IFF - indicating the existence of a concept uniqueto this forum which shares a similar specificity level across the measures. Suchqualitative analysis indicates that despite the composite functions selecting thesame concept to measure the abstraction of, the measures produce, in general,di↵erent rankings based on the concept’s network position.Table 2. Forum rankings using the Type Graph and di↵erent combinations of com-posite functions and abstraction measures. The integers in parentheses represent thelevel of the forum on 1=most specific, 3= most general.Concept Frequency CF-IFFNetwork Entropy Eigenv’ Cent’ Network Entropy Eigenv’ Cent’Discworld (1) Discworld (1) Languages (1) Magic the Gathering (1)The Cuckoo’s Nest (2) Angling (2) Hunting (1) Subscribers (2)Models (2) Paganism (1) File Exchange (2) Unreal (2)Slydice Specials (1) Feedback (2) Game Threads (1) LAN Parties (2)Battlestar Galactica (1) Personal Issues (2) Magic the Gathering (1) World of Warcraft (1)FS Motors (1) Mythology (2) Bangbus (1) Role Playing (2)Gadgets (1) Films (1) Biology & Medicine (2) Midwest (2)FS Music Equipment (1) Business Managem’ (1) Snooker & Pool (2) Game Threads (1)Pro Evolution Soccer (2) Xbox (1) Subscribers (2) GAA (2)Call of Duty (2) Help Desk (2) HE Video Players (1) Midlands (2)Anime & Manga (2) DIT (2) Discworld (1) Discworld (1)
  30. 30. ConclusionsMeasuring the Topical Specificity of Online Communities29¨  Presented an approach to measure the topical specificity ofonline community forums:¤  Extracted entities from forums, recorded frequencies¤  Selected concepts to measure using composite functions¤  Measured concept specificity using information-theoretic measures¨  Findings showed:a)  Type graph > category structuresb)  Eigenvector centrality was the best measurec)  Differences in composite function…n  Top ranks: CF-IFFn  Full rank: Concept Frequency
  31. 31. Current and Future WorkMeasuring the Topical Specificity of Online Communities30¨  Community tracking¤  Grouped high and low specificitycommunities¤  Found significantly greatersemantic divergence in lowspecificity communities¨  Attention Dynamics Theory¤  Plan to group communities bytopical specificity¤  Examine how attention dynamicsdiffer¤  Learn heuristics for dynamic modeladaptationThe Semantic Evolution of General and !Specific Communities!Matthew Rowe and Claudia Wagner!1.  School of Computing and Communications, Lancaster University, Lancaster, UK.!2.  Institute for Information and Communication Technologies, JOANNEUM Research, Graz, Austria.![1] Q Hong, S Kim, S. C. Cheung, and C Bird. Understanding a developer social network and its evolution. In Proceedings of the 2011 27th IEEEInternational Conference on Software Maintenance, ICSM ’11, pages 323–332, Washington, DC, USA. (2011)![2] C. Wagner, M. Rowe, M. Strohmaier, and H. Alani. Ignorance isn’t bliss: an empirical analysis of attention patterns in online communities. In AESConference on Social Computing, (2012)![3] M Rowe, C Wagner, M Strohmaier and H Alani. Measuring the Topical Specificity of Online Communities. In the proceedings of the ExtendedSemantic Web Conference. Montpellier, France. (2013)!•  Past work [1] examined community lifecycle events (creation, growth, etc.), eschewing topical dynamics!•  Our prior work [2] found a link between the topical focus of a community and its attention dynamics!•  E.g. ‘Golf’ community forum required posted content to exactly match the community!•  Changes in a community’s topics require content injection models to be re-trained!•  Expensive and time-consuming process, and often performed offline!•  Our solution: measure community specificity, then examine how community evolution relates to specificity!Motivation!•  Dataset: 230 community forums from!•  Took all posts published within a 1-week window from 23/03/2005!•  Extracted entities from posts using Zemanta!•  Identified the type of each entity using DBPedia ontology!<entity> rdf:type <class>!•  Measured the specificity of each forum [3] using combinations of:!•  Class abstraction measures (e.g. Network entropy)!•  Composite Functions (e.g. Most specific concept)!•  Compared performance against the Knuth Shuffle (random model)!•  Evaluated using Kendall Tau-b and Max-Levels (proposed measure)!•  Max-levels judges the maximal potential outlier level!•  Best model: Eigenvector Centrality with Concept Frequency!Measuring Topical Specificity!MaxLevels0.850.900.95Most SpecMean SpecConcept FreqCF−IFFNetworkEntDegreeCentEigenvCentHitsAuthorityHitsHubStatSubKPPMaxLevels0.850.900.95NetworkEntDegreeCentEigenvCentHitsAuthorityHitsHubStatSubKPP0.850.900.95MostSpecMeanSpecConceptFreqCF−IFF●2 4 6 8 102 4 6 8 Centrality + Concept FrequencyDensityReferences!•  Divided forums up into 10 equal-frequency bins based on specificity!•  Measured: Eigenvector Centrality with Concept Frequency!•  High-specificity forums = members of top bin!•  Low-specificity forums = members of bottom bin!•  For these forums, randomly selected four 1-week periods and derivedconcepts over each period, and concept vectors: !•  Calculated cosine similarity between consecutive concept vectors!•  Examined difference between cosine similarity distributions of highand low forums:!•  Low=0.783, high=0.854. p<0.05 with Student T-test (2 tailed)!•  The result indicates that general communities exhibit greatersemantic drift than topically specific communities!Semantic Evolution!c1,c2,c3,c4{ }0 10 20 30 40 500. SpecificityF(CumulativeSpecificity)High SpecificityLow Specificity0 10 20 30 40 500. Specificity DifferenceF(CumulativeSpecificityDiff)High SpecificityLow SpecificityHigh%Low%
  32. 32. the Topical Specificity of Online Communities