Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights

Fashion
1
insights
Angela Ciliberti
Michele di Padova
Francesco Morazzoni
Navid Nobani

Reddit fashion insights: scope & phases
Analyzing fashion-related comments
in reddit, answering the following
questions:
1. How many people talk and read
about fashion in Reddit?
2. Are there any influencers?
3. Which are the most popular
fashion related topics and
brands?
4. Which is the sentiment with
respect to a certain topic/brand
and how does it evolve over
time?
SCOPE PROJECT PHASES
word2vec
reach &
influencers

• Subreddit: communities in which Reddit users are grouped if they are interested in the related topic.
• Post/comment score: users can express their appreciation/disregard towards a certain post or comment, by
upvoting or downvoting it. Each upvote is worth +1 , while each downvote -1. Proxy of engagement.
• User karma: sum of upvotes and downvotes related to all the posts and comments produced by the user.
How Reddit works
Key features

Scraping
After launching «fashion» as
search key, subreddits were
selected according to their
relevance and the largest number
of subscribers:
1) Male Fashion Advice: 1.4 M
2) Streetwear: 0.8 M
3) Frugal male fashion: 0.7 M
4) Female fashion advice: 0.6 M
Tools
Where to scrape from
The data to be scraped refers to the
following dimensions:
What to scrape
Post-related
• Post_Id
• Post_Title
• Post_Author
• Post_Timestamp
• Post_Points
Comment-related
• Comm_Id
• Post_id
• Comm_Body
• Comm_Author
• Comm_Timestamp
• Comm_Points
Results
• 2 csv per subreddit (1 about
Posts and 1 about Comments)
• Only comments related to the
top 1000 most popular posts
per subreddit (due to API limit)
• 660 K comments
• Total csv size: about 250 MB
Libraries
PRAW datetime

• Low % of comments written by inactive users
(closed accounts)
• Subscribers to FrugalFemaleFashion write on
average more comments than subscribers of
the other subreddit (4.8 vs 3.2 comments per
user) and their comments are on average
longer (256 characters per comment vs 139.7)
Dataset overview: comments and users 1/2
• MaleFashionAdvice seems to be obsolete (the
most popular 1000 posts gather comments
mainly from 2013-2017)
• Streetswear and FrugalFemaleFashion have
mostly comments written in 2017-2018
161,4
134,6
256,0
98,5
Comments length (char)

• Karma scores can be used to identify the most
engaging users, i.e. those receiving the highest
number of upvotes to their comments.
This is a preliminary step for the identification
of influencers.
• Top 10 users by Karma are much more
“productive” in terms of number of comments
• Comments written by top10 users receive
about twice the score of other users
826
3 3 35
267
287 270
Dataset overview: comments and users 2/2
Average # of comments per user

Data cleaning
1. Delete comments having:
• Missing id
• Missing text
• Missing timestamp
2. Delete comments having less than 15 characters
3. Delete comments not in English
4. Remove links
5. Remove strange characters
'n','r','*','$','&','[',']','(',')',«’»
6. Transform all text in lowercase
7. Remove stopwords (not done for sentiment analysis)
Libraries
Steps Example
NLTK LANGID RE OS
“He looks terrible... what are you people smoking?
There's more than enough elegant and stylish apparel for
people his age... he should rock a light blue three piece,
gold pocketwatch and a white fedora or sth, but not this”
“looks terrible... people smoking? theres enough elegant
stylish apparel people age... rock light blue three piece,
gold pocketwatch white fedora sth”

Sentiment analysis
Sentiment analysis has been done on preprocessed text, but without stopwords removal as this could have
strongly decreased the accuracy of the outcome: some negative words are in the nltk stopwords list, so a phrase
containing them such as «this is not good», would loose the «not» and so the sentiment would be wrongly
assigned.
Using textblob library, the polarity of each single comment was evaluated.
• Is the subreddit
community more
engaged by positive,
neutral or negative
comments? In other
words, is a positive
comment more likely to
have a higher score than
a negative comment?
• Does this vary depending
on the subreddit?

Topic – LDA 1/5
• Latent Dirichlet Allocation (LDA) model for discovering the abstract “topics” that occur in our comments
collection.
LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as
random mixtures over latent topics, where a topic is characterized by a distribution over words.
The model has been applied on the comments in order to find out six topics.

Topic –LDA 2/5
The analysis shows
that there is not a
prevalent topic

Topic –LDA 3/5
The first three topics are close, some of the main
words are: shoes, store, dress, outfit

Topic –LDA 4/5
Another group is represented by the fourth
and the fifth topic. Some of the principal
words are: money,wallet, company, people.

Topic –LDA 5/5
The last topic is the
farthest from the others,
and the main words are:
man, shit, fuck

Topic – Clustering 1/3
• In order to better investigate on the topics treatted in the Reddits comments, a new work flow have been
developed:
Document to verctor model has
been applyed in ordet to compute
the cosine similarity matrix.
Doc2Vect Clustering LDA
On each cluser an LDA model has
been apply in order to give a title
to each cluster.
The comments have been clusterized in
six groups using the kmeans algorithm
fitted on the symilarity matrix. The k = 6
has been chosen looking at the shiluette
score.

The comments are not perfectelly
separated, this cause an overlapping in
terms of topic in each cluster. After an
LDA analysis we can named the clusters
as follow:
• Cluster 0 : shoes,bought,cheap
• Cluster 1 : shoes,socks,people
• Cluster 2 : people,good,price
• Cluster 3: price,shoes,people
• Cluster 4 : sale, time,shoes
• Cluster 5: socks,price,buy

A sentiment analysis has been performed for each comment, then an average sentiment score has been assigned to each cluster. This
analysis shows that the clusters don't differs neither from a sentiment point of view. The average sentiment is roughly close to zero
everywhere.

Word Embedding: Word2Vec Model
We have decided to use Gensim package for word embedding. Right at the beginning we have faced two
problems:
1. Model Tuning : gensim.models.Word2Vec has more than 20 hyper parameters
2. Model Evaluation : Not having an score/metric to compare performance of different models
TooManyParametrs
Nocomparisonmetric
Simple solution :
Using the default values of function:
We didn’t use this solution !
Simple solution :
Comparing models based on similar words they
find (based on cosine similarity) for a specific
words.
We didn’t use this solution either !

Word Embedding: HRRC for Model Evaluation
Following the research done in Cornell university (Schnabel et al., 2015) , we have decided to develop our own “Intrinsic Evaluation ”
method (HRRC : Human Rate-Rank Comparison) using the WordSim-353 dataset (Finkelstein et al., 2002). WordSim dataset contains
353 pairs of words and the average similarity score given a similarity score (0-10) by 16 people.
HowitWorks
Smart Student
4.62
1 −
501 − 1
11553
= 0.956 1 −
1728 − 1
11553
= 0.850
Smart Student Smart Student
4.62
10
= 0.462R2 R1_1 R1_2
delta 1 = 0.462 – 0.956 = -0.494
delta 2 = 0.462 – 0.850 = -0.388
There were 138 pair of words which existed
both in our data and WS353 dataset
This process has been repeated for
all 138 pairs. To summarize these
deltas as a single value, we have
calculated the median of sum of
squared deltas.
.//0 = 123456(32895:)
HR = Human Rate /10
MR = 1 −
GHIJK LMNO PQ
RHSMKTUKMVW XYZJ
delta = HR-MR

word2vec Model Tuning
Developing HRRC we used AWC EC2 (t2.medium instance) to perform a grid-search considering the following hyper parameters:
• Minimum length of comment to be considered in the model (from 30 to 45 characters)
• Gensim.Word2Vec iter parameter (from 5 to 30)
• Word2Vec algorithm (CBOW and Skip-Gram)
• Size of output vector ( from 100 to 1400)
BestParameters
3-DScatterplotofallHRRCvalues
672 Models
25.3 Hours

Model Visualization - 1HierchicalClustering(800Comments)
t-SNE-variousAlgorithmsandModelIterations
Skip-GramCBOW

Model Visualization - 2t-SNE-Grid-Search
~16Hour87Models

23
Model Visualization - 3
T-SNE visualization of Final Word2Vec Model
2D 3D

NER – Named Entity Recognition
Ner is a subtask of information
extraction that seeks to locate
and classify named entity
mentions in unstructured text
into pre-defined categories such
as the person names,
organizations, locations, medical
codes, time expressions,
quantities, monetary values,
percentages, etc.

NER – spaCy
An open-source library
for advanced Natural
Language Processing in
Python and Cython. It's
built on the very latest
research, and was
designed from day one to
be used in real products.
https://spacy.io/
FeaturesWhat is ?
• Fastest syntactic parser in
the world
• Named entity recognition
• Non-destructive
tokenization
• Support for 20+ languages
• Pre-trained statistical
models and word vectors
• Easy deep learning
integration
• …
• …
Architecture

Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights

Recommended

Recommended

More Related Content

Similar to Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights

Similar to Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights (20)

More from Carla Marini

More from Carla Marini (8)

Recently uploaded

Recently uploaded (20)

Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights