Posting on social networks could be a gratifying or a terrifying experience depending on the reaction the post and its author —by association— receive from the readers. To better understand what makes a post popular, this project inquires into the factors that determine the number of likes, comments, and shares a textual post gets on LinkedIn; and finds a predictor function that can estimate those quantitative social gestures.
Automate your Kamailio Test Calls - Kamailio World 2024
Prediction of Reaction towards Textual Posts in Social Networks
1. Prediction of Reaction towards Textual Posts in Social Networks
Mohamed Mahmoud (elgeish@stanford.edu)
Abstract
Posting on social networks could be a
gratifying or a terrifying experience de-
pending on the reaction the post and its
author —by association— receive from the
readers. To better understand what makes
a post popular, this project inquires into
the factors that determine the number of
likes, comments, and shares a textual post
gets on LinkedIn; and finds a predictor
function that can estimate those quanti-
tative social gestures.
Keywords: Linear Regression; LinkedIn;
Machine Learning; Popularity; Social Net-
works
1 Introdcution
Social media have been playing a prominent role
in our daily lives in the past decade; they connect
us with our families, friends, and colleagues in un-
precedented ways. Sharing on social networks be-
came a primary means of communication —a basic
human need— that has the potential of reaching
a large group of recipients around the globe. Your
posts on social networks are a reflection of who you
think you are (self-perception), or what you want
others to see; and how readers react to them, is
a reflection of how they perceive you. The lack of
immediacy in such social interactions allows the au-
thor to ruminate on what to write, when sharing a
textual post, in hopes of maximizing the fulfillment
of a desideratum (akin to an essayist or a reporter
seeking a certain goal). On the other hand, readers
reciprocate by showing appreciation of the effort
exerted in the form of virtual social gestures (akin
to fan letters), which may help the post travel fur-
ther, and propagate through many social networks.
Let’s take LinkedIn as an example of social net-
works where members are allowed to like, comment
on, and re-share posts; the more likes a LinkedIn
post gets, the more fulfilled the author feels about
it; such reaction vouches for the author’s ethos, and
supports the author’s claim to fame. Conversely,
the author’s reputation and self-esteem might suf-
fer when the post is uncelebrated. In order to alle-
viate the social pressure associated with sharing on
social media, this work proposes a system that pre-
dicts quantitative reaction towards textual posts in
social networks over a specific time window. This
time series analysis can help authors gauge the pop-
ularity (as a proxy for quality) of an update before
posting it; they can refine the post if the scores are
unsatisfactory, and examine how various versions
of the same post score. In addition, the system can
be used by social networks to predict which posts
are more appealing to readers for ranking purposes
[1]. This work will focus on LinkedIn as the social
network of choice.
2 Input-Output Behavior
The input to the proposed system is a textual post
shared by a member of a social network, and the
output is a prediction of the quantitative reaction
(number of likes, comments, and shares) the post
shall receive within a certain time window; we’ll
use a fixed window of one day. Take, for example,
the post in the figure below:
Figure 1: an example of a textual post
The output corresponding to this input post is
a vector of (predicted number of likes, predicted
number of comments, and predicted number of
shares) — one for each day. Here’s another exam-
ple that should score much higher than the former:
Figure 2: an example of a popular textual post
1
2. 3 Model
Disclaimer: The analysis, exploration, and prepa-
ration of data; feature engineering, and extrac-
tion; use, and fine-tuning of machine learning al-
gorithms, along with code developed for train-
ing, validating, and testing the model; and prac-
tices adopted for this project are driven by my
own personal experience, and not connected to
any LinkedIn product. The data discussed here
were anonymized —by removing personally identi-
fiable information— in accordance with LinkedIn’s
strict policies regarding data privacy. In order
to protect LinkedIn’s intellectual property, some
of the features mentioned below are redacted.
The goal of the proposed system is to obtain
a predictor function f that maps new input x to
output y ∈ R3
corresponding to (predicted number
of likes, predicted number of comments, and pre-
dicted number of shares) for the input post and age
in days (time window).
This is a regression problem, which is solved
by using the following framework (using supervised
learning, which is given the training data to pro-
duce the predictor function):
Figure 3: diagram of a learning framework
The training data Dtrain is a set of examples,
which are basically input-output pairs: the inputs
are the posts (including age in days), and outputs
are the corresponding social gestures.
3.1 Data Preparation
The data came from original textual posts on
LinkedIn; an original post is authored by its poster,
and not a re-share of another post. In order to learn
a predictor, three datasets were gathered for train-
ing, validation, and test; the datasets came from
historical LinkedIn data about using ETL (Extract,
Transform, Load); one of the challenges faced dur-
ing this project is performing ETL at the scale of
LinkedIn: joining multiple datasets that contain
billions of records; each dataset contains certain
aspects of the record that serves as an example for
training. Member IDs, along with personally iden-
tifiable information, were removed once the meta-
data were generated. The combined tally of exam-
ples in the training and validation sets is around
3.8 million — after removing outliers. In order to
maintain good data hygiene, the test set was gath-
ered after training was completed from a date range
that doesn’t overlap with that of the training and
validation datasets, and it amounts roughly to 0.25
million examples (out of 4.05 million examples in
total). The data are split to 75% for training, 19%
for validation, and 6% for testing.
3.1.1 Outliers
Outliers were determined by plotting [2] the dis-
tribution of the label values; here’s an example of
the distribution of label values for likes using a log
scale:
0e+00
2e+05
4e+05
6e+05
log(label value)
count
Figure 4: distribution of label values for likes
After exploring the distribution of data at dif-
ferent bucket widths, a cutoff line was chosen to
remove outliers that are unlikely to occur (based
on the gathered data). This process was repeated
for comments and shares as well.
3.1.2 Missing Data
Some of the fields in the collected records were
missing at random; for example, fields that were
left blank by the members of the social network.
2
3. Whenever possible, a replacement value was cal-
culated for the missing field. For example, for a
missing timezone field, an approximation was cal-
culated using the member’s country. A more in-
teresting example for a missing field is one that’s
real-valued, which was replaced by the mean value
of that field in the observed examples, which is an
acceptable solution to replace data missing at ran-
dom (MAR).
3.1.3 Raw Data
A labeled example is basically a tuple of the tex-
tual post along with its metadata (input); and its
label is the number of likes, comments, and shared
generated for the post (output). A typical record
in a dataset looks like the following:
((Text & Metadata), (Likes, Comments, Shares))
Metadata include data about the post like age, vis-
ibility (e.g. public), author metadata (e.g. network
size), etc.; such metadata is essential to better rep-
resent the inputs (in correlation to the output) in
the context of the problem at hand; in a social net-
work, the text of a post per se is insufficient to
predict its popularity; we need to consider other
factors like metadata about the post and its au-
thor; for example, the size of the author’s network
is expected to play a major role in predicting pop-
ularity.
After the raw data were extracted, the records
were serialized and stored into binary files to save
space. Those binary files served as input files for
the feature extractor.
3.2 Scoring
Each input is going to be distilled into a fea-
ture vector φ(x) = [φ1(x), . . . , φd(x)] ∈ Rd
,
which represents the input and will be computed
using a feature extractor. Correspondingly, a
weight vector wi = [wi1, . . . , wid] ∈ Rd
; i ∈
{likes, comments, shares} specifies the weight of
each feature to each component of the prediction
vector y ∈ R3
.
Given a feature vector φ(x) and a weight vec-
tor wi, the respective prediction score component
yi ∈ R is their inner product:
yi = wi · φ(x)
The score vector represents a snapshot of the
post’s popularity at the given age in days, to get
the time series, the age is incremented by the de-
sired time unit.
3.3 Feature Extraction
Based on domain knowledge, a feature vector φ(x)
is picked to represent an input x and contribute to
the prediction vector y.
The examples were loaded from the binary files
where they were stored, then transformed into tu-
ples of (feature vector, label vector). The feature
vector is the union of raw features and features de-
rived from the raw data. The feature extractor
stored the tuples into binary files using Python’s
cPickle module. Caching the feature vectors and
their respective labels speeds up the training pro-
cess instead of re-extracting the features from the
examples every time predicted scores are calcu-
lated.
The features that contribute to the popularity
of textual posts can be divided into the following:
• Textual features (pertaining to the post’s
text)
• Post metadata features (pertaining to the ac-
tion of posting)
• Author features (pertaining to the author)
In each category, there are real-valued and indi-
cator features that are listed below as feature tem-
plates (to be filled out by the training data).
3.3.1 Textual Features
• log(post length) ∈ [0.0, 3.2): boolean feature
• log(post length) ∈ [3.2, 6.4): boolean feature
• log(post length) ∈ [6.4, ): boolean feature
• contains a URL: boolean feature
• contains a question: boolean feature
• contains an e-mail address: boolean feature
• contains a hashtag: boolean feature
• contains a simley emoticon: boolean feature
• contains a frowny emoticon: boolean feature
• post length: real-valued feature
• log(post length): real-valued feature
• ratio of non-alphanumeric characters: real-
valued feature
• word count: real-valued feature
• stemming: real-valued features, one for each
stem in the text and its count
3
4. • unigrams: real-valued features, one for each
word in the text and its count
• bigrams: real-valued features, one for each bi-
gram (two adjacent words) in the text and its
count
• trigrams: real-valued features, one for each
trigram (three adjacent words) in the text
and its count
3.3.1.1 Bucketization of Post Length
In order to figure out the relationships between
the proposed features and their respective weights,
interactive exploration of the data was performed
as a part of the feature engineering process; plot-
ting various features vs. the number of social ges-
tures uncovered some insights. For example, the
log(post length) can be bucketized into three buck-
ets:
0 2 4 6
log(post length)
numberoflikesexcludingoutliers
Figure 5: bubble chart of log(post length) and num-
ber of likes excluding outliers
The bucket boundaries are 0, 3.2, and 6.4; un-
surprisingly, counts of comments and shares fol-
lowed suit due to the correlation between the three
social gestures:
0 2 4 6
log(post length)
numberofcommentsexcludingoutliers
Figure 6: bubble chart of log(post length) and num-
ber of comments excluding outliers
0 2 4 6
log(post length)
numberofsharesexcludingoutliers
Figure 7: bubble chart of log(post length) and num-
ber of shares excluding outliers
3.3.1.2 Stemming
Using the Snowball stemmer from nltk [3],
which is language-specific, whenever the post’s lan-
guage was supported; otherwise, the Porter stem-
mer was used. The Snowball stemmer has a much
better understanding of the language model —
including stopwords exclusion— and it supports
the following languages: Danish, Dutch, English,
4
5. Finnish, French, German, Hungarian, Italian, Nor-
wegian, Portuguese, Romanian, Russian, Spanish
and Swedish; Many more languages were found in
the examples.
3.3.2 Post Metadata Features
• language of post: indicator feature
• day of month: indicator feature
• day of week: indicator feature
• hour of day: indicator feature
• day of month and hour: indicator feature
• day of week and hour: indicator feature
• sharing visibility: indicator feature
• post is in member interface locale: boolean
feature
• post is in member default locale: boolean fea-
ture
• post is in member locale: boolean feature
• post age in days: real-valued feature
• log(post age in days): real-valued feature
• post age in minutes: real-valued feature
• log(post age in minutes): real-valued feature
• mentions count: real-valued feature
3.3.2.1 Language Identification
Language identification was performed using a
language identifier software library developed by
LinkedIn.
3.3.2.2 Locality of Timestamps
Timestamps were adjusted to represent local
time according to the post’s timezone; the predictor
should calculate the same score for two posts shared
at the same local time — all other factors being
equal. For example, if a member who lives in Cali-
fornia shared a post at 10 AM PST, it should have
the same score as a post shared at 10 AM EST by
an identical member who lives in New York; time-
based features have to be trained with respect to
groups of values that contribute to the score in the
same fashion. This is important because of the role
the time of day when a post was published plays in
predicting its popularity; we here assume that the
majority of the post’s target readership lives in the
same timezone as the poster [4], [5].
3.3.3 Author Features
• default locale: indicator feature
• interface locale: indicator feature
• country: indicator feature
• industry: indicator feature
• timezone: indicator feature
• connections visibility: indicator feature
• feed visibility: indicator feature
• picture visibility: indicator feature
• is a LinkedIn influencer: boolean feature
• interface locale is default: boolean feature
• connections count: real-valued feature
• log(connections count): real-valued feature
• followers count: real-valued feature
• log(followers count): real-valued feature
• average likes count: real-valued feature
• average comments count: real-valued feature
• average shares count: real-valued feature
• a set of proprietary features (redacted)
Figure 8: chart of average likes per member (ex-
cluding outliers) and number of likes (excluding
outliers); unsurprisingly, there is a correlation be-
tween the two dimensions
5
6. 4 Approach
4.1 Baseline
A rule-based system was chosen to predict the num-
ber of likes, comments, and shares for a given input
by searching for certain keywords in the text, and
factoring in the size of the author’s network in a
formula for each prediction, for example:
likes = α(network size) +
word∈text
weight(word)
The coefficients and weights in such system can
be guessed based on heuristics or domain exper-
tise. An example of a baseline that was chosen
with α = 0.011 and weight(’I’) = 1 yielded a large
test error; see the results section for more details.
4.2 Oracle
An oracle can see the future, and tell us exactly
how many likes, comments, and shares a post has
at the end of a future time window. So, in our case,
it’s basically a time machine.
4.3 Linear Regression
Linear regression can be used to predict the num-
ber of social gestures by learning the weights vec-
tors that contribute to the scores. The objective
is to minimize the average loss determined by the
squared loss function:
Losssquared(x, yi, wi) = (wi · φ(x) − yi)2
4.4 Stochastic Gradient Descent
The choice of stochastic gradient descent (SGD) of
linear regression was an obvious one as the datasets
used in this project are large, a fact that influenced
the tuning of hyperparameters and the algorithm
as well.
4.4.1 Hyperparameters
Because of the large number of example, the fol-
lowing formula was used for the learning rate:
η = min(0.001, number of updates)
It starts with a value that’s relatively large –yet
small enough to keep the weights from overflowing–
then get smaller as the number of updates increases
(as the convergence rate increases).
Termination of the algorithm was determined
by either reaching diminishing improvements ( =
0.0001) of the combined training and validation
errors (|TEt+1 − TEt| + |VEt+1 − VEt| < ), or
exhausting the maximum number of iterations al-
lowed; however, the latter can be incremented by 1
if a significant improvement in the validation error
has been observed (ε = 0.1); The same condition
was used to save a snapshot of the training pro-
gram in case it got aborted. When the program
was done, the weights vector was pickled (serial-
ized) and saved to disk for the test program to load
and evaluate. The examples used for training were
equally divided to 100 file; the training dataset had
80 files, and the validation dataset had 20 files. The
order of the training files to process at the start of
each iteration was picked at random, and the con-
tent of each file was then shuffled as well in hopes
of increasing the convergence rate.
4.4.2 Feature Normalization
One of the issues found while training was overflow
of weights; the feature values were highly variant,
and had to be normalized; the following formula
was used to rescale the values:
φ(x) =
φ(x) − min(φ(x))
max(φ(x)) − min(φ(x))
4.5 Gradient Descent Using VW
Vowpal Wabbit [6] makes use of parallel threads,
feature hashing, and cache files to speed up the gra-
dient descent algorithm. Running VW with a single
pass, and with 100 passes generated similar results;
when run with multiple passes, VW held out 10%
of the examples for validation and reported the dev
loss instead of the training loss. VW was run using
the squared loss function, and feature normaliza-
tion.
4.5.1 Hyperparameters
VW was also run with the adaptive option, which
sets an individual learning rate for each feature,
which improves learning when the feature vector is
large [7]. The initial learning rate was set to the
default value (η0 = 0.5).
5 Results
Please note that the fitted coefficients were
redacted to protect LinkedIn’s intellectual prop-
erty.
6
7. Baseline SGD VW (1 Pass) VW (100 Passes)
Likes
Iterations N/A 101 1 100
Features Count 2 3016665 590650105 53157724900
Training RMSE N/A 3.930 3.683 N/A
Dev RMSE N/A 4.222 N/A 3.384
Test RMSE 8.352 2.440 N/A N/A
Comments
Iterations N/A 25 1 100
Features Count 2 3016878 590650105 5315772490
Training RMSE N/A 2.374 2.144 N/A
Dev RMSE N/A 2.497 N/A 2.120
Test RMSE 2.576 1.068 N/A N/A
Shares
Iterations N/A 2 1 100
Features Count 2 3017243 590650105 53157724900
Training RMSE N/A 0.446 0.474 N/A
Dev RMSE N/A 0.484 N/A 0.447
Test RMSE 63.878 0.424 N/A N/A
Table 1: comparison of various approaches
RMSE is the root-mean-square error (since the
predictors used the squared loss function); in the
table above, it’s rounded to the third decimal place.
VW produced a model file of hashed features
and their respective weights; parsing that exces-
sively large vector into a format that the test har-
ness understands —to calculate Test RMSE— was
prohibitive; Dev RMSE is a good proxy for it (in
the scope of this project).
6 Literature Review
Another project that was set to predict Facebook
likes [5] compared two approaches: linear regres-
sion, and nearest neighbor; and found out the for-
mer is more effective in predicting likes than the
latter. It’s also worth mentioning that the number
of examples used in that project, 49216, is two or-
ders of magnitude smaller than the one used here,
which indicates that the test data may not have
enough statistical coverage for a social network of
more than one billion daily unique users [8].
7 Potential Improvements
There are a few more improvements that I would
have liked to explore; for example, adding more
textual features like part-of-speech tagging, and
lemmatization.
Another possible addition to the feature vector
is metadata features about who liked, commented
on, and shared a post; the observed labels repre-
sent a time series of an underlying sequences of
values that aren’t IID (Independent and Identically
Distributed); for example, the probability that a
post gets n more likes at time tj depends on who
liked it at all times before tj because likes propa-
gate through the news feed (when member x likes a
public post y, it shows up as news for x’s network,
and they can like it, comment on it, and/or share
it from their news feeds); this is known as serial
coupling [9]; augmenting the feature vector with
metadata features about members who reacted to
the post can improve the accuracy of predicting its
popularity [10], but it will make the cardinality of
the feature vector orders of magnitude larger than
what is currently used. Exploring other loss func-
tions, ensemble learning, more non-linear features,
and feature interactions (e.g., the cross product of
metadata features) might yield more accurate pre-
dictions.
8 Acknowledgments
I’d like to thank Percy Liang, and CS221 TA’s for
the guidance they gave me throughout the quarter.
I’d also like to thank LinkedIn (special thanks to
Guy Lebanon and Bee-Chung Chen) for providing
the data that made this project possible.
7
8. References
[1] D. Agarwal, B.-C. Chen, Q. He, Z. Hua, G. Lebanon, Y. Ma, P. Shivaswamy, H.-P. Tseng, J. Yang,
and L. Zhang, “Personalizing linkedin feed”, in Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, ser. KDD ’15, Sydney, NSW, Australia: ACM,
2015, pp. 1651–1660, isbn: 978-1-4503-3664-2. doi: 10.1145/2783258.2788614. [Online]. Avail-
able: http://doi.acm.org/10.1145/2783258.2788614.
[2] H. Wickham, Ggplot2: Elegant graphics for data analysis. Springer New York, 2009, isbn: 978-0-387-
98140-6. [Online]. Available: http://had.co.nz/ggplot2/book.
[3] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st. O’Reilly Media, Inc.,
2009, isbn: 0596516495, 9780596516499.
[4] Z. Ellison and S. Hildick-Smith, “Blowing up the twittersphere: Predicting the optimal time to tweet”,
Stanford, Stanford, CA, Tech. Rep., 2014. [Online]. Available: http://cs229.stanford.edu/
proj2014/Seth%20Hildick- Smith, %20Zach%20Ellison, %20Blowing%20Up%20The%
20Twittersphere-%20Predicting%20the%20Optimal%20Time%20to%20Tweet.pdf.
[5] K. Chen, B. Huang, and B. Lee, “Facebook like predictor within your friends”, Northwestern Uni-
versity, Evanston, IL, Tech. Rep., 2015. [Online]. Available: http://kbbz.github.io/files/
Final%20report.pdf.
[6] J. Langford, L. Li, and A. Strehl, Vowpal Wabbit, 2007.
[7] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic
optimization”, J. Mach. Learn. Res., vol. 12, pp. 2121–2159, Jul. 2011, issn: 1532-4435. [Online].
Available: http://dl.acm.org/citation.cfm?id=1953048.2021068.
[8] M. Zuckerberg, 2015.
[9] L. Cao, “Non-iidness learning in behavioral and social data”, The Computer Journal, 2013. doi:
10 . 1093 / comjnl / bxt084. eprint: http : / / comjnl . oxfordjournals . org / content /
early/2013/08/22/comjnl.bxt084.full.pdf+html. [Online]. Available: http://comjnl.
oxfordjournals.org/content/early/2013/08/22/comjnl.bxt084.abstract.
[10] M. Dundar, B. Krishnapuram, J. Bi, and R. B. Rao, “Learning classifiers when the training data is not
iid”, in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI’07,
Hyderabad, India: Morgan Kaufmann Publishers Inc., 2007, pp. 756–761. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=1625275.1625397.
8
9. Appendix A Learning Rate Plots
0
5
10
15
0 5 10 15 20
log2(examples count)
RMSE
comments
likes
shares
Figure A9: line chart of learning rate using VW (single pass)
9
10. 0
5
10
15
0 10 20
log2(examples count)
RMSE
comments
likes
shares
Figure A10: line chart of learning rate using VW (100 passes)
10