Eswc2011 socialweb-discussionpredictions

852 views
830 views

Published on

8th Extended Semantic Web Conference (ESWC 2011), Heraklion, Greece
In this event, the OU team presented their work towards predicting
discussions on the Social Semantic Web, by (a) identifying seed
posts, then (b) making predictions
on the level of discussion that such posts will generate. This
analysis helps policy makers to predict which discussions and users
will generate higher level of attention within the community.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
852
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Web Science talk!!!
  • Provision of a range of applicationsLarge-scale uptake of smart phones
  • How can I track discussions?
  • What features are correlated with discussions?
  • User features more important than content featuresSignificant at alpha = 0.01User reputation therefore more important than the quality of content!
  • Over training split!Pearsoncorrelation: 0.674 = Fair agreement
  • Ran experiments using j48 with all features, testing on the test split
  • Fitted Kolmogorov-Smirnov goodness of fit test
  • For haitiContent features more important than user features at higher ranksUser features show greater performance as ranks are increasedUnion:User features outperform content for all ranks apart from topIndicates the differences in the dynamics of the data
  • Attention Economics!!
  • Eswc2011 socialweb-discussionpredictions

    1. 1. Predicting Discussions onthe Social Semantic Web Matthew Rowe, Sofia Angeletou and Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom
    2. 2. Mass of Social DataSocial content is now published at a staggeringrate….Predicting Discussions on the Social Semantic Web 1
    3. 3. Social Data Publication Rates• ~600 Tweets per second [1]• ~700 Facebook status updates per second [1]• Spinn3r dataset collected from Jan – Feb 2011 [2] – 133 million blog posts – 5.7 million forum posts – 231 million social media posts[1] http://searchengineland.com/by-the-numbers-twitter-vs-facebook-vs-google-buzz-36709[2] http://icwsm.org/data/index.phpPredicting Discussions on the Social Semantic Web 2
    4. 4. The New Information EraPredicting Discussions on the Social Semantic Web 3
    5. 5. But…. Analysis is Limited• Market Analysts – What are people saying about my products?• Opinion Mining – How are people perceiving a given subject or topic?• eGovernment Policy Makers – How is a policy or law received by the public? – How can I maximise feedback to my content?Predicting Discussions on the Social Semantic Web 4
    6. 6. Attention Economics• Given all this data… How do we decide on what information to focus on? How do we know what posts will evolve into discussions?• Attention Economics (Goldhaber, 1997)• Need to understand key indicators of high- attention discussionsPredicting Discussions on the Social Semantic Web 5
    7. 7. Discussions on Twitter• Twitter is used as medium to: – Share opinions and ideas – Engage in discussions • Discussing events • Debating topics• Identifying online discussions enables: – Up-to-date public opinion – Observation of topics of interest – Gauging the popularity of government policies – Fine-grained customer supportPredicting Discussions on the Social Semantic Web 6
    8. 8. Predicting Discussions• Pre-empt discussions on the Social Web: 1. Identifying seed posts • i.e. posts that start a discussion • Will a given post start a discussion? • What are the key features of seed posts? 2. Predicting discussion activity levels • What is the level of discussion that a seed post will generate? • What are the key factors of lengthy discussions?Predicting Discussions on the Social Semantic Web 7
    9. 9. The Need for Semantics• For predictions we require statistical features – User features – Content features• Features provided using differing schemas by different platforms – How to overcome heterogeneity?• Currently, no ontologies capture such featuresPredicting Discussions on the Social Semantic Web 8
    10. 10. Behaviour Ontology www.purl.org/NET/oubo/0.23/Predicting Discussions on the Social Semantic Web 9
    11. 11. T he likelihood of post s elicit ing replies depends upon popularity, a highly subjec- t ive t erm influenced by ext ernal fact ors. Propert ies influencing popularity include Features user at t ribut es - describing t he reput at ion of t he user - and at t ribut es of a post ’s cont ent - generally referred t o as cont ent feat ures. In Table 1 we define user and cont ent feat ures and st udy t heir influence on t he discussion “ cont inuat ion” . T abl e 1. User and Cont ent Feat ures U ser Feat ur es I n D eg r ee: N umb er of fol l ower s of U # O u t D eg r ee: N umb er of user s U fol l ows # L i st D eg r ee: N umb er of l i st s U app ear s on. L i st s gr ou p user s by t op i c # P o st C o u n t : T ot al numb er of p ost s t he user has ever p ost ed # U ser A g e: N umb er of m i nut es fr om user j oi n dat e # P ost C ou n t P o st R at e: Post i ng fr equency of t he user U s er A g e Cont ent Feat ur es P o st l en g t h : L engt h of t he p ost i n char act er s # C o m p l ex i t y : Cumul at i ve ent r opy of t he uni que wor ds i n p ost p λ i ∈ [1, n ] p i ( l og λ − l og p i ) of t ot al wor d l engt h n and pi t he fr equency of each wor d λ U p p er ca se co u n t : N umb er of upp er case wor ds # R ead ab i l i t y : G unni ng fog i ndex usi ng aver age sent ence l engt h ( A SL ) [7] and t he p er cent age of com pl ex wor d s ( P C W ) . 0.4( A SL + P C W ) V er b C o u n t : N umb er of ver bs # N ou n C ou nt : N umb er of nouns # A d j ect i v e C ou n t : N umb er of adj ect i ves # R ef er r al C o u n t : N umb er of @user # T i m e i n t h e d ay : N or m al i sed t i m e i n t he day m easur ed i n m i nut es # I n f o r m at i v en ess: T er m i nol ogi cal novel t y of t he p ost wr t ot her p ost s T he cumul at i ve t fI df val ue of each t er m t i n p ost p t∈p t f i df ( t , p) P o l ar i t y : Cumul at ion of p ol ar t er m weight s in p ( using P o+ N e Sent i wor dnet 3 l ex i con) nor m al i sed by p ol ar t er m s count | t er m s | 4.2 Ex p er im ent sPredicting Discussions on the Social Semantic Web 10 Experiment s are int ended t o t est t he performance of different classificat ion mod-
    12. 12. different motives which do not necessarily designat e a response t o t he initial Identifying Seedmessage. T herefore, we only invest igat e explicit replies to messages. To gatherour discussions, and our seed post s, we it erat ively move up t he reply chain - i.e., Postsfrom reply t o parent post - until we reach t he seed post in t he discussion. Wedefine this process as dataset enrichment, and is performed by querying T wit t er’s • ExperimentsREST API 6 using t he in reply to id of t he parent post, and moving one-st ep ata t ime up the reply chain. T his same approach has been employed successfully – Haiti and Union Address Datasetsin work by [12] t o gather a large-scale conversation dat aset from T wit ter. – Divided each dataset up using 70/20/10 split for training/validation/testing s used for experiment s T abl e 2. St at ist ics of t he dat aset D at aset U ser s T weet s Seeds N on-Seeds Repli es H ai t i 44,497 65,022 1,405 60,686 2,931 U ni on A ddr ess 66,300 80,272 7,228 55,169 17,875 TableEvaluated a binary classification task datasets. One can – 2 shows the statist ics that explain our collected • Is this post a seed post or not?observe the difference in conversat ional tweet s between t he two corpora, wheret he Hait i dat aset cont ains fewer F1 and Areaa percent age t han t he Union • Precision, Recall, seed post s as under ROCdataset, and therefore fewer replies. However, as we explain lat er a lat er sect ion, • Tested: user, content, user+content featuresthis does not correlate with a higher discussion volume in the former dat aset. Weconvert t he collected dataset s from SVM, Naïve Bayesformat s int o triples, – Tested Perceptron, t heir propriet ary JSON and J48annot ated using concepts from our above behaviour ont ology, this enables ourfeatures t o be derived by querying our dat aset s using basic SPARQL queries. Predicting Discussions on the Social Semantic Web 11
    13. 13. t he F1 score - and used t his model t oget her wit h t he best performing feat ures, Identifying Seedusing a ranking heurist ic, t o classify post s cont ained wit hin our t est split . Wefirst report on t he result s obt ained from our model select ion phase, before moving Postsont o our result s from using t he best model wit h t he t op-k feat ures.T abl e 3. Result s from t he classificat ion of seed post s using varying feat ure set s andclassifi cat ion models (a) Hait i Dat aset (b) Union A ddress Dat aset P R F1 ROC P R F1 ROC U ser Per c 0.794 0.528 0.634 0.727 U ser Per c 0.658 0.697 0.677 0.673 SV M 0.843 0.159 0.267 0.566 SV M 0.510 0. 946 0.663 0.512 NB 0.948 0.269 0.420 0.785 NB 0.844 0.086 0.157 0.707 J48 0.906 0.679 0.776 0.822 J48 0.851 0.722 0.782 0.830 Cont ent Per c 0.875 0.077 0.142 0.606 Cont ent Per c 0.467 0. 698 0.560 0.457 SV M 0.552 0.727 0.627 0.589 SV M 0.650 0.589 0.618 0.638 NB 0.721 0.638 0.677 0.769 NB 0.762 0.212 0.332 0.649 J48 0.685 0.705 0.695 0.711 J48 0.740 0.533 0.619 0.736 A ll Per c 0.794 0.528 0.634 0.726 A ll Per c 0.630 0.762 0.690 0.672 SV M 0.483 0.996 0.651 0.502 SV M 0.499 0. 990 0.664 0.506 NB 0.962 0.280 0.434 0.852 NB 0.874 0.212 0.341 0.737 J48 0.824 0.775 0.798 0.836 J48 0.890 0.810 0.848 0.8774.3 R esult sOur findings from Table 3 demonst rat e t he effect iveness of using solely userfeatures for ident ifying seed post s. In bot h t he Hait i and Union Address12 aset s Predicting Discussions on the Social Semantic Web dat
    14. 14. Identifying Seedwhich we found t o be 0.674 indicat ing a good correlat ion between t he two list s Postsand t heir respect ive ranks.T abl e 4. Feat ures ranked by Informat ion Gain Rat io wrt Seed Post class label. T hefeat ure name is paired wittheit most bracket s. • What are hin s IG in important features? R ank H ai t i U ni on A ddr ess 1 user -l i st -degr ee ( 0.275) user -l i st -degr ee ( 0.319) 2 user -i n-degr ee ( 0.221) cont ent -t i m e-i n-day ( 0.152) 3 cont ent -i nfor m at i veness ( 0.154) user -i n-degr ee ( 0.133) 4 user -num -p ost s ( 0.111) user -num -p ost s ( 0.104) 5 cont ent -t i m e-i n-day ( 0.089) user -p ost -r at e ( 0.075) 6 user -p ost -r at e ( 0.075) user -out -degr ee ( 0.056) 7 cont ent -p ol ar i t y ( 0.064) cont ent -r efer r al -count ( 0.030) 8 user -out -degr ee ( 0.040) user -age ( 0.015) 9 cont ent -r efer r al -count ( 0.038) cont ent -p ol ar i t y ( 0.015) 10 cont ent -l engt h ( 0.020) cont ent -l engt h ( 0.010) 11 cont ent -r eadabi l i t y ( 0.018) cont ent -com pl exi t y ( 0.004) 12 user -age ( 0.015) cont ent -noun-count ( 0.002) 13 cont ent -upp er case-count ( 0.012) cont ent -r eadabi l i t y ( 0.001) 14 cont ent -noun-count ( 0.010) cont ent -ver b-count ( 0.001) 15 cont ent -ad j -count ( 0.005) cont ent -ad j -count ( 0.0) 16 cont ent -com pl exi t y ( 0.0) cont ent -i nfor m at i veness ( 0.0) 17 cont ent -ver b-count ( 0.0) cont ent -upp er case-cou nt ( 0.0) Predicting Discussions on the Social Semantic Web 13
    15. 15. 5 cont ent -t i m e-i n-day ( 0.089) user -p ost -r at e ( 0.075) Identifying Seed 6 u ser -p ost -r at e ( 0.075) user -out -degr ee ( 0.056) 7 cont ent -p ol ar i t y ( 0.064) cont ent -r efer r al -count ( 0.030) 8 u ser -out -degr ee ( 0.040) user -age ( 0.015) 9 cont ent -r efer r al -count ( 0.038) cont ent -p ol ar i t y ( 0.015) Posts 10 cont ent -l engt h ( 0.020) cont ent -l engt h ( 0.010) 11 cont ent -r ead ab i l i t y ( 0.018) cont ent -com pl exi t y ( 0.004) 12 user -age ( 0.015) cont ent -noun-count ( 0.002) 13 cont ent -upp er case-count ( 0.012) cont ent -r eadabi l i t y ( 0.001) 14 cont ent -noun-count ( 0.010) cont ent -ver b-count ( 0.001)• What is the correlation between seed posts 15 16 cont ent -ad j -count ( 0.005) cont ent -com pl exi t y ( 0.0) cont ent -ad j -count ( 0.0) cont ent -i nfor m at i veness ( 0.0) and features? 17 cont ent -ver b-cou nt ( 0.0) cont ent -upp er case-count ( 0.0)Predicting Discussions onof t op-5 feat Semantic Web F i g. 3. Cont ribut ions the Social ures t o ident ifying Non-seeds (N ) and Seeds(S). 14 Upper plot s are for t he Hait i dat aset and t he lower plot s are for t he Union A ddress
    16. 16. Identifying Seed Posts• Can we identify seed posts using the top-k features?Predicting Discussions on the Social Semantic Web 15
    17. 17. Predicting Discussion Activity• From identified seed posts: – Can we predict the level of discussion activity? – How much activity will a post generate?• [Wang & Groth, 2010] learns a regression model, and reports on coefficients – Identifying relationship between features• We do something different: – Predict the volume of the discussionPredicting Discussions on the Social Semantic Web 16
    18. 18. Predicting Discussion ActivityPredicting Discussions on the Social Semantic Web 17
    19. 19. feat ures. Using the predict ed values for each post we can then form a ranking, Predictingwhich is comparable to a ground t rut h ranking wit hin our dat a. T his providesdiscussion activity levels, where posts are ordered by t heir expected volume. T his Discussion Activityapproach also enables context ual predict ions where a post can be compared wit hexisting posts t hat have produced lengthy debates. • Compare rankings5.1 ExpGround s – er im ent truth vs predictedD at• Experiments aset s For our experiment s we used t he same dat aset s as in t he previoussection: – Using Haiti from the Haiti Address tdatasets tweets collected and Union crisis and he Union Address speech. Wemaintain t he same split s as before - t raining/ validat ion/ t est ing wit h a 70/ 20/ 10 – Evaluation measure: Normalised Discountedsplit - but without using t he t est split . Inst ead we t rain t he regression models Cumulative Gainusing t he seed post s in t he t raining split and t hen test t he predict ion accuracyusing the seed post s in the validat ion split k={1,5,10,20,50,100) • Assessing nDCG@k where - seed post s in the validation setare identified using Support Vectorrained using botwith: content feat ures. – Tested the J48 classifier t Regression h user+Table 5 describes our dat aset s user+content features • user, content, for clarification. T abl e 5. Discussion volumes and dist ribut ions of our dat aset s D at aset T r ai n Si ze T est Si ze T est Vol M ean T est Vol SD H ai t i 980 210 1.664 3.017 U nion A ddr ess 5,067 1,161 1.761 2.342 Predicting Discussions on the Social Semantic Web 18
    20. 20. Predicting Discussion Activityfeatures. Alt hough different coefficient s are learnt for each dat aset, for majorfeat ures with greater weights the signs remain the same. In the case of the list-degree of t he user, which yielded high IGR during classificat ion, t here is a similarposit ive associat ion with the discussion volume - t he same is also t rue for t hein-degree and t he out -degree of t he user. T his indicates t hat a const ant increasein the combinat ion of a user’s in-degree, out-degree, and list-degree will lead t oincreased discussion volumes. Out -degree plays an import ant role by enablingt he seed post aut hor t o see post ed responses - given t hat the larger t he out -degreethe greater the reception of informat ion from ot her users.T abl e 6. Coefficient s of user feat ures learnt using Support Vect or Regression over t hetwo Dat aset s. Coefficient s are rounded t o 4 dp. user -num -p ost s user -out -degr ee user -in-degr ee user -l ist -degr ee user -age user -p ost -r at eH ai t i -0.0019 + 0.001 + 0.0016 + 0.0046 + 0.0001 + 0.0001U ni on -0.0025 + 0.0114 + 0.0025 + 0.0154 -0.0003 -0.00026 Conclusions on the Social Semantic Web Predicting Discussions 19
    21. 21. Findings• User reputation and standing is crucial – eliciting a response – starting a discussion• Greater broadcast capability = greater likelihood of response – More listeners = more discussion• Activity levels influenced by out-degree – Allow the poster to see response from ‘respected’ peersPredicting Discussions on the Social Semantic Web 20
    22. 22. Conclusions• Pre-empt discussions to empower – Market analysts – Opinion mining – eGovernment policy makers• Behaviour ontology – Captures impact across platforms• Approach accurately predicts: – Which posts will yield a reply, and; – The level of discussion activityPredicting Discussions on the Social Semantic Web 21
    23. 23. Current and Future Work• Experiments over a forum dataset – Content features >> user features – Different platform dynamics• Extend experiments to a random Twitter dataset• Extension to behaviour ontology – Captures concentration – i.e. focus of a user on specific topics• Categorising users by role – Based on observed behaviourPredicting Discussions on the Social Semantic Web 22
    24. 24. Questions?people.kmi.open.ac.uk/rowem.c.rowe@open.ac.uk@mattroweshowQUESTIONSPredicting Discussions on the Social Semantic Web 23

    ×