My keynote presentation: 'Gaining, retaining and losing influence in online communities' from a conference at Kings College, London on the topic of 'social influence in the information age'
2. Acknowledgements
• Researchers: Simon Jones and James Dove
• Collaborators & advisors: Yorick Wilks,
Louise Guthrie, Arthur Thomas, Martin
Groen, Jan Noyes, Dawn Eubanks, Sam
Hunter, John Horgan, Leon Watts, Andy
Swarbrick, Niqi Cummings.
• Sponsors
7. Bath /
UWE
studies
Who are influentials?
How do they behave?
Who becomes influential?
Who loses influence? Why?
8. Samples
• RL
– ca. 2 million posts
– 35 subforums
– 7,000 active users
– 10 years archive
• IA
– ca. 500,000 posts
– 30 subforums
– 900 active users
– 7 year archive
• LH
- ca. 21,000 posts
- 20 subforums
- 250 active members
- 3 years archive
• Enron
- 520k emails, 150 people,
4 years
11. 11
Data collection
• Developed our own ‘scraping’
system that works with circa 70%
forums on the Internet
• Hosted on remote, off shore server
• Now contains > 5 million posts
• System runs via TOR encryption
13. Meta-data collected & derived
• Structural Features
• In Degree
• Out Degree
• Reciprocity Features
• % of bi-directional Neighbours
• % of threads with reciprocal communication
• Persistence Features
• Average Posts per Thread
• Std Dev. Posts per Thread
• Average post length
• Std Dev. post length
• Content Features
• % of quotes posted
• % of posts containing ?’s
• % of posts containing URLs
Additional Meta-data
Time since joining
Tendency (RL only)
Initialisation Features
% of threads initiated by user
Diversity Features
% of threads participated in
% of sub forums participated in
19. Sampling
• All posters with single
post removed
• Top 10% rep power and
reputation chosen
• Stratified sample across
remaining 90% matched
• 245 leaders / non leaders
from IA, 353 leaders /
non from RL
19
%
20. Study 1: The language of ‘opinion leaders’
online
• Three step regression equations to predict leader vs.
non-leader in IA / RL using:
– Linguistic markers (e.g. less 1st
person, more past tense,
more readability)
– Meta-data (URLs, question marks, num. posts)
• 90/10 split used for training / validating
• Final prediction accuracy: 85% (IA) and 94% (RL)
21. 21
Group Higher in opinion
leaders
Lower in opinion
leaders
Shared (RL & IA) Past tense
Number of posts
(total)
2nd person (‘you’)
Flesch Readability
Work words
1st person singular
(‘I’)
Religion words
Ave. Word Count
Question marks
RL only Negative Emotion
Adverbs
Words 6ltrs or more
Assent words
1st person plural (we)
Positive Emotion
IA only Assent words
Non-fluencies
Fillers
URL links
Words 6ltrs or more
22. Study 2: Language + Networks
• SNA metrics: Centrality, Page Rank, Clustering
• Meta-data: Activity levels
• LIWC: all 83 features
• Naïve Bayesian Classifier using 10-fold validation
(unsupervised machine learning)
• IA and LH sample
22
23. Background
• ‘The masses do not now take their opinions from dignitaries
in Church or State, from ostensible leaders, or from books.
Their thinking is done for them by men much like
themselves, addressing or speaking in their name, on the
spur of the moment . . .’ (John Stuart Mill, On Liberty)
• “. . . leadership at its simplest: it is casually exercised,
sometimes unwitting and unbeknown, within the smallest
groupings of friends, family members, and neighbors. It is
not leadership on the high level of Churchill, nor of a local
politico; it is the almost invisible, certainly inconspicuous
form of leadership at the person-to-person level of ordinary,
intimate, informal, everyday contact.” (Katz & Lazarsfeld,
1955)
24. Influence and the 2-step model
• Katz and Lazarsfeld
(1955)
– Messages are
intercepted by
‘influentials’
– “from radio and print
to opinion leaders and
from them to less
active sections of the
population” (p.32)
55
25. 57
Accidental influencers: influence is due to position in the network (a contagion
approach)
‘Cascades’ don’t differ in pattern, just speed and scale due to position (Watts & Dodds,
2007)
26. SNA measures
• Betweenness Centrality
– The number of shortest paths that pass through a vertex divided by the number of
shortest paths in the network
– e.g. In a network of spies; who is the spy through which most confidential information is
likely to flow?
• Eigenvector Centrality
– A vertices’ eigenvector centrality is proportional to the sum of the eigenvector centralities
of all vertices connected to it
– e.g. In a network of citations who is the author that is most cited by other well cited
authors?
• PageRank
– The rank value indicates an importance of a particular page. A hyperlink to a page
counts as a vote of support. The PageRank of a page is defined recursively and depends
on the number and PageRank metric of all pages that link to it
• Clustering Coefficient
– The local clustering coefficient of a node in a network graph quantifies how close its
neighbors are to being a clique (complete graph).
26
27.
28. Study 2: Classifier results
Community % correctly
classified
% non-
leaders
incorrectly
classified as
leaders
% leaders
incorrectly
classified as non-
leaders
RL 88.5% 10.5% 1%
IA 83.0% 13.9% 3.1%
31. So….
• Users with high reputation are characterised
by:
• high activity / posting level
• highly connected, diverse networks
• relatively few 1st person singular, 3 person
depends on the context
• rhetorical flourishes
• Aristotle (350BC): ethos, pathos and/or
logos
31
32. Study 3: Enron leadership
• Can these techniques be used to identify
leaders elsewhere?
• Enron dataset: 150 users, 1.5m emails
• Cleaned to ‘pre 1991’ and ---original
message--- removed.
• Job titles identified via existing sources,
LinkedIn, court records.
32
33. Study 3: Method
• LIWC variables based on previous work (e.g.
pronouns, tense, tentativeness, argumentation).
• Entered into Logistic Regression – Leaders
(CEO -> Manager: n=69) vs. Non-Leaders
(Traders -> Employee: n=71). Lawyers
removed from data set.
• 74% accuracy
33
34. Study 3: Predictors
• Final model: R2
= .41: 74% accuracy
• Pronouns: Less 1st
person (‘I’), more 2nd
person
(‘you’)
• Argumentation: Less certainty, less ‘exclusion’
works, more tentativeness
• Emotion: More anxiety words
• Similar pattern then re: 1st
person, negative
emotion, ‘soft’ argumentation
34
35. Summary, Conclusions
• Possible to automate
identification of opinion leaders
online using:
– Meta-data, language, SNA: upwards
of 85% accuracy
• Leadership more nuanced than
expected:
– Softer – more tentative, less ‘bossy’
– Credible: readability, knowledge
• Methods might allow additional
insights into the nature of opinion
leadership and influence…
37. Weakness of previous studies
• Reputation and reputation power based on
vBulletin algorithm that is unknown
• You retain the highest level you’ve ever had -
but people gain, and lose, position
• So, we adopted a behavioural approach to
studying influence, using social roles
37
38. Study 4: Social roles
• Over 100 years research on the topic. Social
roles:
– Impose structure on interaction and organise
behaviour
– They are, “recognized, accepted, and used to
accomplish pragmatic interaction goals in a
community” (Callero, 1994, p. ,232)
– Most easily identified through behavioural
regularities and patterns of relationships – a
‘structural signature’ (Gleave et al., 2009)
41. Previous work (relations, SNA)
• Usenet: answer vs. discussion people on
Usenet
• Wikipedia – substantive experts vs. technical
editors
41
42. Role = what you do + who you
do it with
42
i.e. both behaviour and network
43. Method
• Expectation Maximization (EM) Clustering
based on all meta-data (introduced earlier)
using Weka 3.5
• Identifies clusters based on Gaussians &
mean and covariance matrices)
• Assigns a probability distribution to each
instance.
• Validated using 10 folds
63. Results
• Always a leader vs. never a leader – highly accurate
(98%)
– More central, influential ties (‘page rank’), active (word
count), readable, more ‘thanks’, ‘strong’ / ‘political’
language, less URLS, ‘submissive’ language and personal
pronouns.
• Promotion from contributor (note: only 6% of
contributors ever made this move)
– SNA centrality/page rank, word count, ‘political’ words,
readability. 88% accurate in matched sample.
• Promotion from collaborator (around 70% accuracy)
– SNA page rank / centrality, vector similarity to expert texts,
readability, thank rate, ‘political’ words.
63
64. Summary, Conclusions
• Identifiable social roles within these communities
• Most shared between the three, similar behavioural
characteristics
• Large churn – sample role composition
– Suggests that a way of identifying the resilience of a
community may be via the roles users’ inhabit
– And, targeting specific roles might improve / degrade the
community functioning.
• It’s possible to predict who will become a leader using
SNA + language + some behaviour.
64
67. Predicting movement between roles
• Multinomial logistic regression (RL only)
–Leader -> Leader (reference category)
–Leader -> Another role
–Leader -> Inactive
• DVs (from 6 months previous to move):
–Network metrics (centrality, page rank)
–Some simple meta-data (e.g. URL posting)
–Language (Harvard Inquirer categories for influence)
• R2
.35 (SNA/meta-data), .43 (SNA/meta-data, language)
68. Results
• Influential -> Drop out
• Reduced betweenness, page rank
• Increased clustering
• Reduced thank rate (i.e. % thanks per post)
• Increased bi-directional conversations
• Increased question marks
• Increased use of: Weak language
• No other language differences
69. Results
• Influential -> Another role
• Reduced betweenness, pagerank
• Increased clustering
• Increased bi-directional conversations (for
those who became joining conversationalists)
• Increased question marks
• Increased use of: weak language
• No other changes in language use
69
71. Summary
• Gaining influence is about:-
• being active, on topic, credible
• Maintaining influence:-
• Varied, active, useful, interested
• Losing influence:-
• Clustered, insular, doubtful
• Giving up all together:-
• Lack of recognition?
• Next steps - the impact on behaviour71