• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Semantic Evolution of Online Communities
 

The Semantic Evolution of Online Communities

on

  • 681 views

World Wide Web Conference 2014

World Wide Web Conference 2014

Statistics

Views

Total Views
681
Views on SlideShare
661
Embed Views
20

Actions

Likes
1
Downloads
5
Comments
0

1 Embed 20

https://twitter.com 20

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Semantic Evolution of Online Communities The Semantic Evolution of Online Communities Presentation Transcript

    • THE SEMANTIC EVOLUTION OF ONLINE COMMUNITIES MATTHEW ROWE1 AND MARKUS STROHMAIER2 1. LANCASTER UNIVERSITY, LANCASTER, UK @MROWEBOT | M.ROWE@LANCASTER.AC.UK 2. UNIVERSITY OF KOBLENZ AND GESIS, COLOGNE, GERMANY @MSTROHM | MARKUS.STROHMAIER@GESIS.ORG World Wide Web Conference 2014 Seoul, South Korea
    • Studies of Online Community Evolution The Semantic Evolution of Online Communities 1 ¨  Prior work has examined online community development based on: ¤  Social network properties n  (Gong et al., 2012), (Mislove et al., 2007) ¤  Social group formation n  (Backstrom et al., 2006) ¤  Lexical term usage and uptake n  (Danescu et al., 2013) Organic fixie Pitchfork, fingerstache fashion axe 8-bit ethical. Neutra shabby chic brunch, mustache vegan twee typewriter dreamcatcher try-hard organic church-key! subsequent interpretation, of churne nities. Churners present a serious managers and hosts as the leaving of a detrimental effect on the communi a question-answering community ca unanswered queries). In this section we define churn classification task and use the previo tors of lifecycle trajectories to predi churner or not. As we confine user the start of their lifecycle to the end mined from this period to characteri We define churners as any user who before the final 10% of the time w cutoff points are: 2012-07-09 for Fac SAP, and 2010-12-23 for ServerFaul following form: D = {(xi, yi)}, whe label of the user from one of two while xi denotes an 11-element R-va either a Facebook or SAP user, and = {w1,w2,…,wn}
    • The Semantic Evolution of Online Communities 2 Work has yet to examine the semantic evolution of online communities 1.  Understand how semantic concepts emerge 2.  Model the development of semantic structures over time 3.  Examine how communities differ from one another in their evolution
    • Assessing Semantic Evolution The Semantic Evolution of Online Communities 3 Define: Community Semantics! Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts! Time …
    • Assessing Semantic Evolution The Semantic Evolution of Online Communities 4 ¨  Assessing community semantics enables: 1.  Characterisation of semantic evolution dynamics 2.  Comparison based on community evolution 3.  Forecasting churn rates from evolution signals Our Contributions Define: Community Semantics! Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts!
    • The Semantic Evolution of Online Communities Approach: Examining Semantic Evolution5 Retrieve Posts and Extract Entities Construct Time- delimited Semantic Graphs Inspect Macro Evolution Induce Community- Specific Evolution Models Apply Mined Semantic Evolution Dynamics We have two flavours: (i)  concept graphs (ii)  entity graphs Applying graph measures over cumulative time-sensitive graphs Inducing logistic population models for each community and graph measure Using model dynamics as community motifs to: (i)  inspect communities (ii)  predict churn rate Experiment Dataset: Boards.ie All posts during 2005-2008 93 Communities: online forums
    • Extracting Entities from Online Community Posts The Semantic Evolution of Online Communities 6 ¨  We are provided with a set of post quadruples: ¨  We constrain content to within a given interval [t’,t’’) ¨  Extract each post’s entities within the time interval using TextRazor ¤  No quota limit, and good prior performance (Derczynski et al., 2013) ¨  Result: time-sensitive community-discussed entities (DBPedia URIs) Retrieve Posts and Extract Entities specialised topics of discussion over A. Posts are provided as a set of quadruples <u, s, t, f> 2 P, where user u posted message s at time t in forum f. A message (s) is com- posed of terms that we use to build the semantic models for individual communities. The information discussed in a community, and thus its semantics, can change and alter over time, therefore we constrain a community’s model to specific time snapshots - e.g. t0 ! t00 where t0 < t00 - for this we use the following construct that filters through all relevant posts’ contents within the allotted time window: St0 t00 = {s : <u, s, t, f> 2 P, t0  t < t00 } (1) Information discussed within online communities can be represented in terms of its semantics, using information from either the schema-level (i.e. ontological classes and relations between them) or the data level (i.e. using entities and how they are related to one another). For the former we consider concepts to be classes found within the DBPedia Ontology, that is: the types of entities that users are discussing (e.g. people, locations, etc.), while for the latter case we con- Pa ide ite rea for {c, riv fro 3.2 A wh con fro as - w ing pro NITIES WITH SEMANTIC GRAPHS For our experiments we used data from the Irish commu- nity message board Boards.ie.2 This is a general-discussion community message board that includes a set of hierarchi- cally nested forums (F) in which posts are made - i.e. fo- rum A can be a parent of forum B, and thus B contains specialised topics of discussion over A. Posts are provided as a set of quadruples <u, s, t, f> 2 P, where user u posted message s at time t in forum f. A message (s) is com- posed of terms that we use to build the semantic models for individual communities. The information discussed in a community, and thus its semantics, can change and alter over time, therefore we constrain a community’s model to specific time snapshots - e.g. t0 ! t00 where t0 < t00 - for this we use the following construct that filters through all relevant posts’ contents within the allotted time window: St0 t00 = {s : <u, s, t, f> 2 P, t0  t < t00 } (1) In or the se that conne such (i.e. w Path ident iterat reach forme {c, p, rived from 3.2 An User u posted content s at time t in forum f
    • Building Semantic Resource Graphs The Semantic Evolution of Online Communities 7 1.  Concept graphs ¤  Nodes: Union of rootpaths from each entity’s class to owl:Thing class ¤  Edges: Between nodes within the DBPedia Ontology 2.  Entity graphs ¤  Nodes: Entities provided by time-specific post extractions ¤  Edges: Relations between entity pairs within the DBPedia linked data graph Construct Time- delimited Semantic Graphs e - i.e. fo- B contains e provided er u posted s) is com- tic models iscussed in e and alter s model to < t00 - for through all window: } (1) ties can be mation from d relations es and how we consider Ontology, ussing (e.g. se we con- (e.g. dbpe- ents, St0 t00 , such concepts, and derive a fully connected concept graph (i.e. with no disconnected components) we extract the Root Path Graph as follows: From each concept (c 2 RC ) we identify the parent concept (<c rdfs:subClassOf p>) and iteratively move up the concept graph until the root node is reached (owl:Thing), thereby returning a set of nodes that formed the path from c to the root node: rootpath(c) = {c, p, ...,owl:Thing}. The graph’s vertices are therefore de- rived by taking the union of all paths to the root returned from each concept within the seed set: Vf = [ c2RC rootpath(c) (3) 3.2 Entity Graphs An entity graph (GE) is a type of semantic resource graph where the vertices (V ) are entities and the set of edges (E) connecting these entities are relations between them derived from the Web of Linked Data. We define the entity graph as GE[f, t0 , t00 ] = hVf , Ef i, such that GE[f, t0 , t00 ] ⇢ Gentity - where Gentity denotes the DBPedia entity graph contain- ing relations between entities at the data level. As we are provided with a collection of time-delimited entities RE for a given forum, we query DBPedia for links between entity pairs and add such edges to the graph where such a link exists: nd char- Google+. cial net- oo! An- node ar- ng times ed term on how lthough cial net- rs evolve MMU- HS commu- scussion ierarchi- i.e. fo- contains provided u posted is com- resource URIs, is typed according to one or more classes from the DBPedia ontology.3 Therefore to construct the set of concepts that are cited within a given forum f over the allotted time period we retrieve the classes that each en- tity is a type of and store these in the following set: RC . From this set we generate a time-dependent concept graph: GC [f, t0 , t00 ] = hVf , Ef i, such that GC [f, t0 , t00 ] ⇢ Gtype - where Gtype denotes the DBPedia type graph formed from the class structure of the DBPedia ontology. In this con- text the set of concepts denotes the seed set and is used to populate the vertices in the graph and then construct edges between the vertices based on existing links between the concepts in the DBPedia type graph (Gtype): Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2) In order to derive the set of vertices we must consider how the seed set can be used for this process as it is often the case that the set is comprised of concepts which are not directly connected to one another in the concept graph. To connect such concepts, and derive a fully connected concept graph (i.e. with no disconnected components) we extract the Root Path Graph as follows: From each concept (c 2 RC ) we identify the parent concept (<c rdfs:subClassOf p>) and iteratively move up the concept graph until the root node is c models cussed in and alter model to < t00 - for rough all window: (1) es can be tion from relations and how e consider Ontology, sing (e.g. we con- .g. dbpe- nts, St0 t00 , a time pe- t0 t00 using of entities returned reached (owl:Thing), thereby returning a set of nodes that formed the path from c to the root node: rootpath(c) = {c, p, ...,owl:Thing}. The graph’s vertices are therefore de- rived by taking the union of all paths to the root returned from each concept within the seed set: Vf = [ c2RC rootpath(c) (3) 3.2 Entity Graphs An entity graph (GE) is a type of semantic resource graph where the vertices (V ) are entities and the set of edges (E) connecting these entities are relations between them derived from the Web of Linked Data. We define the entity graph as GE[f, t0 , t00 ] = hVf , Ef i, such that GE[f, t0 , t00 ] ⇢ Gentity - where Gentity denotes the DBPedia entity graph contain- ing relations between entities at the data level. As we are provided with a collection of time-delimited entities RE for a given forum, we query DBPedia for links between entity pairs and add such edges to the graph where such a link exists: Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4) Given this edge construction mechanism we only look for relations one-hop away in the entity graph, that is: given
    • Measuring Graph Dynamics The Semantic Evolution of Online Communities 8 1.  Node Count: size of the graph 2.  Diameter: breadth of the graph 3.  Specialisation Count: class specialisations 4.  Graph Entropy: density of the graph 5.  Clustering Coefficient: cliquishness of the graph Inspect Macro Evolution Computation of the measures is described within the paper
    • The Semantic Evolution of Online Communities 9 Macro Evolution Cumulative Time Interval Entropy of the Concept Graph Showing the mean graph entropy across all communities and the 95% Confidence Interval 0 20 40 60 80 100 120 150200250300 Timestep (a) Node Count 0 20 40 60 80 100 120 4.04.55.05.5 Timestep H(G) (b) Graph Entropy 0 20 40 60 80 100 150200250 Timestep Specialisations (c) Specialisations : Concept graphs’ evolution based on node counts, graph entropy and spec ty of the graph grows as more entities are of a given online community based on Inspect Macro Evolution
    • Inspect Macro Evolution The Semantic Evolution of Online Communities 10 Macro Evolution: Concept Graphs 0 20 40 60 80 100 120 150200250300 Timestep |V| (a) Node Count 0 20 40 60 80 100 120 4.04.55.05.5 Timestep H(G) (b) Graph Entropy 0 20 40 60 80 100 120 150200250 Timestep Specialisations (c) Specialisations Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations. ing that the density of the graph grows as more entities are added and thus more connections are possible between them. Summary: We can summarise the following salient find- ings: (i) for concept graphs: node count, specialisation count and density (graph entropy) tend to converge to limit; (ii) for entity graphs, the diameter, graph entropy and cluster- ing coe cient tend to converge to a limit, while the node count (number of entities) increases linearly; (iii) despite new entities arriving at a constant linear rate, on average, of a given online community based on a single graph mea- sure. We exclude the derivation of the equations from the paper, but it is su cient to conclude that given |T| time steps we would have a single equation for each time step (t 2 T): Rtr 1 + PtE 1 = 1. We can then solve for the unknown variables r and E using the QR-decomposition of a matrix: expressing the lefthand side of the simultaneous equations as a |T| ⇥ 2 matrix and the righthand side as a |T|-element vector where each element is 1. We induced logistic population models for each of the graph measures Node count, graph entropy, and specialisations converge to a limit
    • Inspect Macro Evolution The Semantic Evolution of Online Communities 11 Macro Evolution: Entity Graphs 0 20 40 60 80 100 120 050010001500 Timestep |V| Linear Model (β=9.932) (a) Node Count 0 20 40 60 80 100 120 1234567 Timestep Diameter (b) Diameter 0 20 40 60 80 100 120 1234 Timestep H(G) (c) Graph Entropy 0 20 40 60 80 100 120 0.020.060.100.14 Timestep CC (d) Clustering Coe cient Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe cient. and entity graphs defined within a single vector for a given community (f): mf = {m1, m2, . . . , m13}. We began by de- riving a single matrix M 2 R93⇥13 containing the 93 commu- nities (rows) under analysis together with their 13 evolution measures (columns), and performing principal component analysis over this matrix. The result of this clustering is shown in Fig. 3 where we have colour-coded the di↵erent community forums by their hierarchical level in the plat- form (level 2 = most general, level 4 = most specific). We found that several of the more general forums appeared as dynamics of a given community at time step t to predict the churn rate of community members at time t + 1. We defined the churn rate of a community as the proportion of active users during a given time period (i.e. week segment) that post for the last time. We used the semantic dynamics motifs from the prior experiment (as listed within Fig. 3) and also included graph measures at a given time period: i.e. graph entropy at time t, specialisation count at time t, etc. We derived these features for every time step for each community and derived the response variable as the churn Diameter, graph entropy, and clustering coefficient converge to a limit
    • Modelling the Semantic Evolution of Individual Communities The Semantic Evolution of Online Communities 12 ¨  For each community and measure, induce a logistic population model: 1.  Get the set of measure increase time steps (T) 2.  Derive the proportionate growth rate for each step 3.  Use the set of proportionate growth rates to solve for unknown community-specific evolution variables (r & E) T = {a : a 2 [1, 119], m(G1,a+1) > m(G1,a)} Deriving the set of change time steps for a given com allows the proportionate growth rate for a given t (t 2 T) to be derived: Rt = (Pt+1 Pt)/Pt. Th is equivalent to the following equation which defi proportionate growth rate Rt in terms of the comm growth rate (r) and carrying capacity (E), our u variables: Rt = r(1 Pt/E). Therefore if we mea proportionate growth rate over the |T| distinct tim then we can derive, via simultaneous equations, the rate of the graph and its carrying capacity, the ve sures that we can use to characterise the semantic e Induce Community- Specific Evolution Models Proportionate Growth Rate @ t Population Growth Rate Population (measure) size @ t Carrying Capacity (Equilibrium) All communities’ models fit with χ2-test at α<0.01
    • The Semantic Evolution of Online Communities 13 Semantic Evolution… what can we actually use it for?
    • The Semantic Evolution of Online Communities Application: Community Evolution Analysis14 Apply Mined Semantic Evolution Dynamics ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −60 −40 −20 0 20 40 60 −40−20020 PC1 PC2 7 9 10 11 12 18 19 20 21 22 23 24 25 31 34 37 38 47 52 54 55 56 6064 68 82 86 93 99 105 107 108 109 116 120 124 125 126 127 136 137 151 171 177 227 232 237 246 252 259 264 267 269 271 333 343 346 370 382 388 389 392 410 411 443 446 453 464 468 471 474 475 476 478 481 482 483 490 495 503 506 512 514 518 522 529 532 542544 545 547 554 556 ● ● ● Level 2 Level 3 Level 4 E EG − CSemantic Evolution Motif: ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 20 40 6022 24 5 31 34 37 54 56 82 93 105 109 177 227 2 264 269 333 392 443 446468 4 3 490 495 6 512 514522 529 542544 554 556 ● ● ● Level 2 Level 3 Level 4 CG − Node Count − Rate CG − Node Count − Equilbrium CG − Graph Entropy − Rate CG − Graph Entropy − Eq CG − Specialisation Count − Rate CG − Specialisation − Equilibrium EG − Node Count − Slope EG − Diameter − Rate EG − Diameter − Equilibrium EG − Entropy − Rate EG − Entropy − Equilibrium EG − Clustering Coefficient − Rate EG − Clustering Coefficient − Equilibrium f7 − After Hours f227 − Television f554 − Wanted Motors 10−1 100 101 102 103 he communities based on their semantic motifs (left) where level 4 forums are clustered lues for the concept graphs (CG) and entity graphs (EG) for the three outlier forums right). hat concept and entity graph den- row linearly (unlike in social net- erges on a limit, which we charac- Conference on Advances in Social Networks Analysis and Mining, 0:724–725, 2012. [4] Cristian Danescu-Niculescu-Mizil, Robert West, Dan 20 40 60 80 100 120 Timestep Linear Model (β=9.932) (a) Node Count 0 20 40 60 80 100 1201234567 Timestep Diameter (b) Diameter 0 20 40 60 80 100 1 1234 Timestep H(G) (c) Graph Entropy 2: Entity graphs’ evolution based on node counts, diameter, graph entro ity graphs defined within a single vector for a given nity (f): mf = {m1, m2, . . . , m13}. We began by de- single matrix M 2 R93⇥13 containing the 93 commu- ows) under analysis together with their 13 evolution es (columns), and performing principal component over this matrix. The result of this clustering is dynamics of a given com the churn rate of commu defined the churn rate of active users during a give that post for the last time motifs from the prior exp 40 60 80 100 120 Timestep Linear Model (β=9.932) Node Count 0 20 40 60 80 100 120 1234 Timestep Diameter (b) Diameter 0 20 123 H(G) (c) G Entity graphs’ evolution based on node counts, diame raphs defined within a single vector for a given (f): mf = {m1, m2, . . . , m13}. We began by de- le matrix M 2 R93⇥13 containing the 93 commu- ) under analysis together with their 13 evolution dynami the chu defined active u
    • Application: Community Evolution Analysis The Semantic Evolution of Online Communities 15 Apply Mined Semantic Evolution Dynamics ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 PC1 10 22 24 25 31 34 37 38 54 56 60 68 82 86 93 105 107 109 177 227 232 237 6 259 264 269 271 333 43 346 382 388 389 392411 443 446 453 464 468 471 474 476 478 481 482 483 490 495 3 506 512 514 518 522 529 542544 545 554 556 ● ● ● Level 2 Level 3 Level 4 CG − Node Count − Rate CG − Node Count − Equilbrium CG − Graph Entropy − Rate CG − Graph Entropy − Eq CG − Specialisation Count − Rate CG − Specialisation − Equilibrium EG − Node Count − Slope EG − Diameter − Rate EG − Diameter − Equilibrium EG − Entropy − Rate EG − Entropy − Equilibrium EG − Clustering Coefficient − Rate EG − Clustering Coefficient − Equilibrium f7 − After Hours f227 − Television f554 − Wanted Motors 10−1 100 101 102 103 t of the communities based on their semantic motifs (left) where level 4 forums are clustered el values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums els (right). Entity Graph: fastest growth in the random forum Concept graph: slowest growth in random forum, largest equilibrium
    • Application: Churn Rate Prediction The Semantic Evolution of Online Communities 16 ¨  Task: forecast community churn rate at t+1 ¤  Based on semantic evolution up to t ¨  Trained a ridge regression model ¨  Root Mean Square Error: Predicted vs. Actual churn rate Apply Mined Semantic Evolution Dynamics Training Test 0 120 150 Training Test dynamics of a given community at time step t to predict the churn rate of community members at time t + 1. We defined the churn rate of a community as the proportion of active users during a given time period (i.e. week segment) that post for the last time. We used the semantic dynamics motifs from the prior experiment (as listed within Fig. 3) and also included graph measures at a given time period: i.e. graph entropy at time t, specialisation count at time t, etc. We derived these features for every time step for each community and derived the response variable as the churn rate at the following time step. We then compiled a train- ing dataset (up to week 120) and a test dataset (from week 120). Each dataset had the following form: D = {(xi, yi)}, where xi contained a 21-element time-delimited feature vec- tor for a given community and yi was the churn rate of the community at the following time step. We trained a ridge regression model ( ) using Dtrain and applied it to Dtest, testing the performance of: a) just concept graph features, b) just entity graph features, and c) all features. An autore- gressive model was used as the baseline - using the churn rate at time t as a single predictor variable for the churn rate at time t + 1. Performance was evaluated using the Root Mean Square Error (RMSE). Table 1: Root Mean Square Error when predicting churn rates using: an Autoregressive model (R2 = 0.341) and a Ridge Regression model using Concept
    • Application: Churn Rate Prediction The Semantic Evolution of Online Communities 17 sions : we te of that t the con- size. aphs: h but g to- ount After ums, ty of ecific The and neral ntity gressive model was used as the baseline - using the churn rate at time t as a single predictor variable for the churn rate at time t + 1. Performance was evaluated using the Root Mean Square Error (RMSE). Table 1: Root Mean Square Error when predicting churn rates using: an Autoregressive model (R2 = 0.341) and a Ridge Regression model using Concept Graph, Entity Graph, and all features. Baseline Concept Graph Entity Graph All Features 7.310 ⇥10 3 5.315 ⇥10 3 5.301 ⇥10 3 4.941 ⇥10 3 Table 1 presents the results from our prediction our exper- iment. We found that for all tested models (concept graph, entity graph, all features) we significantly outperformed the baseline - tested using the sign test (↵ = 0.001). Entity graph features outperform concept graphs but not signifi- cantly, while our best model is the use of all features together in a single model. These results empirically demonstrate the Significant reduction in error (Sign test with α=0.001) Baseline: Autoregressive model with churn rate @ t as a predictor for t+1 Apply Mined Semantic Evolution Dynamics
    • Findings and Conclusions The Semantic Evolution of Online Communities 18 ¨  Semantic graphs of online communities do not grow linearly: instead, they evolve to a limit ¤  Unlike in social networks [Lekovec et al., 2008] ¤  Exception: entity graph size ¨  A finite number of topics are discussed within communities ¤  Variation between communities ¨  Our use of logistic population models has enabled: 1.  Characterisation of community-specific evolution dynamics 2.  Community analysis to inspect how communities evolved differently 3.  Churn prediction based on semantic evolution
    • Future Work The Semantic Evolution of Online Communities 19 1.  Expanded to cover other online communities ¤  Question-answering, mined communities 2.  User profiling ¤  Capturing user-specific semantic evolution In-edge weight distribution (ServerFault) ● ● ● ● ● 1 2 3 4 5 0.50.60.70.80.91.01.11.2 k= 5 Lifecycle Stage H ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 4 6 8 10 0.30.40.50.60.7 k= 10 Lifecycle Stage H ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.050.150.250.35 H Non-churners Churners
    • Matthew Rowe @mrowebot m.rowe@lancaster.ac.uk http://www.lancaster.ac.uk/staff/rowem/ Markus Strohmaier @mstrohm markus.strohmaier@gesis.org http://markusstrohmaier.info/ Questions?20 The Semantic Evolution of Online Communities