Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users

86 views

Published on

talk for paper published at ICWE2019:
Primo F, Missier P, Romanovsky A, Mickael F, Cacho N. A customisable pipeline for continuously harvesting socially-minded Twitter users. In: Procs. ICWE’19. Daedjeon, Korea; 2019.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users

  1. 1. A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users Paolo Missier(*), Alexander Romanovsky(*), Nélio Cacho(+), Flavio Primo(*), Mickael Figueredo(+) (*) Newcastle University, UK (+) Federal University of Rio Grande do Norte, Brazil Southampton, June 18, 2019
  2. 2. Dengue Fever is an endemic problem which Brazilian health system is incapable to handle with 460 thousands cases of dengue fever1 increase of 240% of cases increase of 29% of mortality Dengue Fever, Zika virus and Chikungunya virus Endemic Problem 1 1 - http://g1.globo.com/bemestar/dengue/noticia/2015/04/casos-de-dengue-aumentam-240-em-2015-e-pais-tem- 4605-mil-casos.html
  3. 3. Microcephaly Notifications (2015-2016) 2
  4. 4. Social Sensors 18 Keyword filtering Preprocessing (bag of words) Classifica9on Clustering Epidemics-sensi9ve Keywords Twi@er stream Users as social sensors Detec. ng relevant and ac. onable social content 2 1 3.1 3.2 Implicit signalsExplicit signals Rapid Interven9ons Public Info portal Dengue Repor9ng app
  5. 5. Twitter 7 Mapping the Twitter signal Sousa L, de Mello R, Cedrim D, et al. VazaDengue: An information system for preventing and combating mosquito-borne diseases with social networks. Inf Syst. 2018;75. doi:10.1016/j.is.2018.02.003
  6. 6. 6 PaoloMissier-2018 Examples of tweets Directly actionable tweets (breeding sites) Indirectly actionable tweets - Sickness - News Jokes:
  7. 7. 7 PaoloMissier-2018 Results 1600 tweets Problems: 1. Noise! 2. Concept drift in social discourse about the topic - seasonal, change in perspective over time  Supervised learning requires re-training 3. Poor geolocation
  8. 8. Problem: insufficient public resources / health officials Hypothesis: surveillance tasks can be crowdsourced Challenge: noise, “needle in a haystack” Content classification  user discovery
  9. 9. Who are we looking for? ● “socially-minded” ● Engaged in multiple specific social topics ● Active within communities ● Local contexts / events (organisers or participants) ● Engagement sustained over time Activist: A person who believes strongly in political or social change and takes part in activities such as public protests to try to make this happen (Cambridge Dictionary) [1] Poell T. Social media and the transformation of activist communication: exploring the social media ecology of the 2010 Toronto G20 protests. Information, Commun Soc. 2014;17(6):716-731. doi:10.1080/1369118X.2013.812674 Social media facilitates activists’ communication and organization [1]
  10. 10. Hypothesis Online activists are recognisable from their online footprint - Definition can be grounded in objective and inexpensive metrics - Generic across specific social themes
  11. 11. Why not influencers? Influencers: “Prominent individuals with special characteristics that enable them to affect a disproportionately large number of their peers with their actions” [2] ● Large body of metrics [3] – aimed at high visibility users across global networks ● Not focused on (societal) impact [4] [2] Kardara M, Papadakis G, Papaoikonomou A, Tserpes K, Varvarigou T. Large-scale evaluation framework for local influence theories in Twitter. Inf Process Manag. 2015;51(1):226-252. doi:https://doi.org/10.1016/j.ipm.2014.06.002 [3] Riquelme F, González-Cantergiani P. Measuring user influence on Twitter: A survey. Inf Process Manag. 2016;52(5):949- 975. doi:10.1016/J.IPM.2016.04.003 [4] Cha M, Haddai H, Benevenuto F, Gummadi KP. Measuring User Influence in Twitter : The Million Follower Fallacy. Int AAAI Conf Weblogs Soc Media. 2010:10-17. doi:10.1.1.167.192 Activists: low-key, less prominent users with local scope
  12. 12. Goals and Challenges Non-goal: to suggest an operational definition of online activist profiles Challenges • Expect smaller online footprint than influencers • Poor signal / noise ratio • Unsupervised  validation • Generalise: activists  custom users profiles • Customisable users discovery pipeline for Twitter: topics / space / time • Continuous discovery  repeat users / reinforce signal • Unsupervised  (nearly) fully automated Goal: Identify users who are actively engaged in online events
  13. 13. Approach ● Global network analysis  many small context-specific networks ● Contexts are Twitter Search queries ● User features + ranking  reuse accepted metrics (content + topological) ● Continuous, semi-automatic discovery of new contexts over time
  14. 14. Contexts Off topic context: all content within same geo- temporal boundaries no hashtag focus hashtags K, geo- and temporal- located – on topic context Query result (from the search API) – “posted tweets”
  15. 15. 1. Harvesting content from context
  16. 16. 2. user-user network Users who posted within context User-user context network Ui Uj Uk 2 1 1
  17. 17. 2. user-user network
  18. 18. 3. Community detection
  19. 19. 3. Community detection: Infomap [5] Random Walker path: 1111100 1100 0110 11011 10000 11011 0110 0011 10111 1001 0011 1001 0100 0111 10001 1110 0111 10001 0111 1110 0000 1110 10001 0111 1110 0111 1110 1111101 1110 0000 10100 0000 1110 10001 0111 0100 10110 11010 10111 1001 0100 1001 10111 1001 0100 1001 0100 0011 0100 0011 0110 11011 0110 0011 0100 1001 10111 0011 0100 0111 10001 1110 10001 0111 0100 10110 111111 10110 10101 11110 00011 Code length: 314 bitsRandom Walker path Huffman Coding of the nodes [5] Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4), 1118-1123.
  20. 20. 3. Community detection: Infomap Random Walker path: 111 0000 11 01 101 100 101 01 0001 0 110 011 00 110 00 111 1011 10 111 000 10 111 000 111 10 011 10 000 111 10 111 10 0010 10 011 010 011 10 000 111 0001 0 111 010 100 011 00 111 00 011 00 111 00 111 110 111 110 1011 111 01 101 01 0001 0 110 111 00 011 110 111 1011 10 111 000 10 000 111 0001 0 111 010 1010 010 1011 110 00 10 011 Code length: 243 bitsRandom Walker path using prefixes Dual problem: detect communities by compressing the description of information flows on networks
  21. 21. Community detection: DEMON DEMON: based on ego networks [6] ● Label propagation algorithm  local  efficient ● Users may be assigned to multiple communities [6] Arnaboldi V, Conti M, Passarella A, Pezzoni F. Ego networks in Twitter: An experimental analysis. In: 2013 Proceedings IEEE INFOCOM. 2013:3459-3464. doi:10.1109/INFCOM.2013.6567181
  22. 22. 4. Profiling Communities  Characterize users with metrics Context independent ● Follower rank Context specific ● Indegree centrality Content based ● Topical Focus ● Topical Strength ● Topical Attachment
  23. 23. 4. Profiling -- topological / community metrics Context independent: Context specific:
  24. 24. 4. Profiling Content based: Evaluated both on- and off-topic Where: ● P1: # of original posts by u in C ● P2: # urls found in original posts by u in C ● R3: # of retweets of u's tweets [7] Missier P, McClean C, Carlton J, et al. Recruiting from the Network: Discovering Twitter Users Who Can Help Combat Zika Epidemics. In: Cabot J, De Virgilio R, Torlone R, eds. Web Engineering: 17th International Conference, ICWE 2017, Rome, Italy, June 5-8, 2017, Proceedings. Roma, Italy: Springer International Publishing; 2017:437-445. doi:10.1007/978-3-319-60131-1_30 [8] Bizid I, Nayef N, Boursier P, Faiz S, Morcos J. Prominent Users Detection During Specific Events by Learning On- and Off-topic Features of User Activities. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ASONAM ’15. New York, NY, USA: ACM; 2015:500-503 [8] [8] [7]
  25. 25. 5. Ranking functions -- experimental ↑ on topic ↓ community leader ↑ on topic ↑ community leader ↓ popularity ↑ on topic ↓ community leader ↓ popularity
  26. 26. Content Harvesting Network Generation Communities Detection Users Profiling Users Ranking Users Profiles C Twitter Search API Q(C) P(C) P(C) U-U network {U1..Uk} Users Ranking functions Users Metrics - content-based - topological Pipeline – summary top-k users
  27. 27. Evaluation: tweets harvesting Unsupervised discovery  no prior ground truth A posteriori empirical validation of the generated user rankings Topic • UK Health campaigns Contexts • 25 campaigns Content • 200x25 tweets Users • 3,500 viable users
  28. 28. Evaluation: test contexts 25 contexts 1 day → 1 month 254 nodes 253 edges 2 density -0.2 assortativity C Assortativity is the tendency for nodes to connect to nodes with the same degree (>0) or not (<0).
  29. 29. Community detection: results When using DEMON: - 48% of the networks result in valid communities  6% of users belong to these communities - 52% of the networks result in no communities  all users added When using Infomap: - All networks successfully partitioned into communities - Avg 8.5 users / community About 3,570 users added in both cases Avg 142 users/network
  30. 30. Evaluation: repeated users Repeated users are ranked higher (multiple participations to contexts). example: Remove inactive users: ● FR(u) = 0 ● min_max(|Tweets(u)|) < 0.005
  31. 31. Evaluation – repeat users Top 10 repeat users
  32. 32. Evaluation – repeat users Number of repeat users for each context
  33. 33. Evaluation: choice of ranking function Number of ranked users: 3567 R3 effectively finds individuals
  34. 34. Evaluation – individual/org vs on/off topic Top 10 ranked users for the ranking functions R1, R2 and R3
  35. 35. Extension 1: Twitter contexts expansion Why: to overcome Twitter premium API limitations How: use user mentions found in relevant tweets to incrementally expand set of users Given Context C = <K, T, S> C  P(C)  U.= U(C) # authors + mentioned users while # user limit condition U’ = expand(U,C) def expand(U,C): Unew = {} for u in U: P = filter(harvest(u), C) Unew = Unew + mentions(P) return Unew
  36. 36. Content Harvesting Network Generation Communities Detection Users Profiling Users Ranking Users Profiles C0 Ci+1 Twitter Search API Q(C) P(C) P(C) U-U network {U1..Uk} Users Ranking functions Users Metrics - content-based - topological Discover Candidate tags Tags selection Extension 2: Semi-automated context discovery top-k users In progress!
  37. 37. Thank you! Code and full-text paper available @ http://bit.ly/activists-paper Paolo.Missier@ncl.ac.uk, @PMissier A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter Users Paolo Missier(*), Alexander Romanovsky(*), Nélio Cacho(+), Flavio Primo(*), Mickael Figueredo(+)

×