SlideShare a Scribd company logo
Fashion
1
insights
Angela Ciliberti
Michele di Padova
Francesco Morazzoni
Navid Nobani
Reddit fashion insights: scope & phases
Analyzing fashion-related comments
in reddit, answering the following
questions:
1. How many people talk and read
about fashion in Reddit?
2. Are there any influencers?
3. Which are the most popular
fashion related topics and
brands?
4. Which is the sentiment with
respect to a certain topic/brand
and how does it evolve over
time?
SCOPE PROJECT PHASES
word2vec
reach &
influencers
• Subreddit: communities in which Reddit users are grouped if they are interested in the related topic.
• Post/comment score: users can express their appreciation/disregard towards a certain post or comment, by
upvoting or downvoting it. Each upvote is worth +1 , while each downvote -1. Proxy of engagement.
• User karma: sum of upvotes and downvotes related to all the posts and comments produced by the user.
How Reddit works
Key features
Scraping
After launching «fashion» as
search key, subreddits were
selected according to their
relevance and the largest number
of subscribers:
1) Male Fashion Advice: 1.4 M
2) Streetwear: 0.8 M
3) Frugal male fashion: 0.7 M
4) Female fashion advice: 0.6 M
Tools
Where to scrape from
The data to be scraped refers to the
following dimensions:
What to scrape
Post-related
• Post_Id
• Post_Title
• Post_Author
• Post_Timestamp
• Post_Points
Comment-related
• Comm_Id
• Post_id
• Comm_Body
• Comm_Author
• Comm_Timestamp
• Comm_Points
Results
• 2 csv per subreddit (1 about
Posts and 1 about Comments)
• Only comments related to the
top 1000 most popular posts
per subreddit (due to API limit)
• 660 K comments
• Total csv size: about 250 MB
Libraries
PRAW datetime
• Low % of comments written by inactive users
(closed accounts)
• Subscribers to FrugalFemaleFashion write on
average more comments than subscribers of
the other subreddit (4.8 vs 3.2 comments per
user) and their comments are on average
longer (256 characters per comment vs 139.7)
Dataset overview: comments and users 1/2
• MaleFashionAdvice seems to be obsolete (the
most popular 1000 posts gather comments
mainly from 2013-2017)
• Streetswear and FrugalFemaleFashion have
mostly comments written in 2017-2018
161,4
134,6
256,0
98,5
Comments length (char)
• Karma scores can be used to identify the most
engaging users, i.e. those receiving the highest
number of upvotes to their comments.
This is a preliminary step for the identification
of influencers.
• Top 10 users by Karma are much more
“productive” in terms of number of comments
• Comments written by top10 users receive
about twice the score of other users
826
3 3 35
267
287 270
Dataset overview: comments and users 2/2
Average # of comments per user
Data cleaning
1. Delete comments having:
• Missing id
• Missing text
• Missing timestamp
2. Delete comments having less than 15 characters
3. Delete comments not in English
4. Remove links
5. Remove strange characters
'n','r','*','$','&','[',']','(',')',«’»
6. Transform all text in lowercase
7. Remove stopwords (not done for sentiment analysis)
Libraries
Steps Example
NLTK LANGID RE OS
“He looks terrible... what are you people smoking?
There's more than enough elegant and stylish apparel for
people his age... he should rock a light blue three piece,
gold pocketwatch and a white fedora or sth, but not this”
“looks terrible... people smoking? theres enough elegant
stylish apparel people age... rock light blue three piece,
gold pocketwatch white fedora sth”
Sentiment analysis
Sentiment analysis has been done on preprocessed text, but without stopwords removal as this could have
strongly decreased the accuracy of the outcome: some negative words are in the nltk stopwords list, so a phrase
containing them such as «this is not good», would loose the «not» and so the sentiment would be wrongly
assigned.
Using textblob library, the polarity of each single comment was evaluated.
• Is the subreddit
community more
engaged by positive,
neutral or negative
comments? In other
words, is a positive
comment more likely to
have a higher score than
a negative comment?
• Does this vary depending
on the subreddit?
Topic – LDA 1/5
• Latent Dirichlet Allocation (LDA) model for discovering the abstract “topics” that occur in our comments
collection.
LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as
random mixtures over latent topics, where a topic is characterized by a distribution over words.
The model has been applied on the comments in order to find out six topics.
Topic –LDA 2/5
The analysis shows
that there is not a
prevalent topic
Topic –LDA 3/5
The first three topics are close, some of the main
words are: shoes, store, dress, outfit
Topic –LDA 4/5
Another group is represented by the fourth
and the fifth topic. Some of the principal
words are: money,wallet, company, people.
Topic –LDA 5/5
The last topic is the
farthest from the others,
and the main words are:
man, shit, fuck
Topic – Clustering 1/3
• In order to better investigate on the topics treatted in the Reddits comments, a new work flow have been
developed:
Document to verctor model has
been applyed in ordet to compute
the cosine similarity matrix.
Doc2Vect Clustering LDA
On each cluser an LDA model has
been apply in order to give a title
to each cluster.
The comments have been clusterized in
six groups using the kmeans algorithm
fitted on the symilarity matrix. The k = 6
has been chosen looking at the shiluette
score.
Topic – Clustering 2/3
The comments are not perfectelly
separated, this cause an overlapping in
terms of topic in each cluster. After an
LDA analysis we can named the clusters
as follow:
• Cluster 0 : shoes,bought,cheap
• Cluster 1 : shoes,socks,people
• Cluster 2 : people,good,price
• Cluster 3: price,shoes,people
• Cluster 4 : sale, time,shoes
• Cluster 5: socks,price,buy
Topic – Clustering 3/3
A sentiment analysis has been performed for each comment, then an average sentiment score has been assigned to each cluster. This
analysis shows that the clusters don't differs neither from a sentiment point of view. The average sentiment is roughly close to zero
everywhere.
Word Embedding: Word2Vec Model
We have decided to use Gensim package for word embedding. Right at the beginning we have faced two
problems:
1. Model Tuning : gensim.models.Word2Vec has more than 20 hyper parameters
2. Model Evaluation : Not having an score/metric to compare performance of different models
TooManyParametrs
Nocomparisonmetric
Simple solution :
Using the default values of function:
We didn’t use this solution !
Simple solution :
Comparing models based on similar words they
find (based on cosine similarity) for a specific
words.
We didn’t use this solution either !
Word Embedding: HRRC for Model Evaluation
Following the research done in Cornell university (Schnabel et al., 2015) , we have decided to develop our own “Intrinsic Evaluation ”
method (HRRC : Human Rate-Rank Comparison) using the WordSim-353 dataset (Finkelstein et al., 2002). WordSim dataset contains
353 pairs of words and the average similarity score given a similarity score (0-10) by 16 people.
HowitWorks
Smart Student
4.62
1 −
501 − 1
11553
= 0.956 1 −
1728 − 1
11553
= 0.850
Smart Student Smart Student
4.62
10
= 0.462R2 R1_1 R1_2
delta 1 = 0.462 – 0.956 = -0.494
delta 2 = 0.462 – 0.850 = -0.388
There were 138 pair of words which existed
both in our data and WS353 dataset
This process has been repeated for
all 138 pairs. To summarize these
deltas as a single value, we have
calculated the median of sum of
squared deltas.
.//0 = 123456(32895:)
HR	=	Human	Rate	/10
MR	=	1 −
GHIJK LMNO PQ
RHSMKTUKMVW XYZJ
delta	=	HR-MR
word2vec Model Tuning
Developing HRRC we used AWC EC2 (t2.medium instance) to perform a grid-search considering the following hyper parameters:
• Minimum length of comment to be considered in the model (from 30 to 45 characters)
• Gensim.Word2Vec iter parameter (from 5 to 30)
• Word2Vec algorithm (CBOW and Skip-Gram)
• Size of output vector ( from 100 to 1400)
BestParameters
3-DScatterplotofallHRRCvalues
672 Models
25.3 Hours
Model Visualization - 1HierchicalClustering(800Comments)
t-SNE-variousAlgorithmsandModelIterations
Skip-GramCBOW
Model Visualization - 2t-SNE-Grid-Search
~16Hour87Models
Model Visualization - 3
23
Model Visualization - 3
T-SNE visualization of Final Word2Vec Model
2D 3D
NER – Named Entity Recognition
Ner is a subtask of information
extraction that seeks to locate
and classify named entity
mentions in unstructured text
into pre-defined categories such
as the person names,
organizations, locations, medical
codes, time expressions,
quantities, monetary values,
percentages, etc.
NER – spaCy
An open-source library
for advanced Natural
Language Processing in
Python and Cython. It's
built on the very latest
research, and was
designed from day one to
be used in real products.
https://spacy.io/
FeaturesWhat is ?
• Fastest syntactic parser in
the world
• Named entity recognition
• Non-destructive
tokenization
• Support for 20+ languages
• Pre-trained statistical
models and word vectors
• Easy deep learning
integration
• …
• …
Architecture
NER – results
ORG
NER – results
GPE

More Related Content

Similar to Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights

Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
acijjournal
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune
APT
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
devipatnala1
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course Pune
APT
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdf
APT
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Pune
ashvisingh
 
data science certification
data science certificationdata science certification
data science certification
devipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
devipatnala1
 
Data science course in Pune
Data science course in PuneData science course in Pune
Data science course in Pune
Data Analytics Courses in Pune
 
Data science course in pune
Data science course in puneData science course in pune
Data science course in pune
Data Analytics Courses in Pune
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in banglore
devipatnala1
 
Data science certification
Data science certificationData science certification
Data science certification
Data Analytics Courses in Pune
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
ashvisingh
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
Data Analytics Courses in Pune
 
Data mining
Data miningData mining
Data mining
devipatnala1
 
data science certification
data science certificationdata science certification
data science certification
Data Analytics Courses in Pune
 
data science course in pune
data science course in punedata science course in pune
data science course in pune
devipatnala1
 
The LinkedIn Effect: A new way of learning? OU Conference presentation
The LinkedIn Effect: A new way of learning?  OU Conference presentationThe LinkedIn Effect: A new way of learning?  OU Conference presentation
The LinkedIn Effect: A new way of learning? OU Conference presentation
Louise Worsley
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
mydrynan
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
IRJET Journal
 

Similar to Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights (20)

Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course Pune
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdf
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Pune
 
data science certification
data science certificationdata science certification
data science certification
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
Data science course in Pune
Data science course in PuneData science course in Pune
Data science course in Pune
 
Data science course in pune
Data science course in puneData science course in pune
Data science course in pune
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in banglore
 
Data science certification
Data science certificationData science certification
Data science certification
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
 
Data Science Course
Data Science CourseData Science Course
Data Science Course
 
Data mining
Data miningData mining
Data mining
 
data science certification
data science certificationdata science certification
data science certification
 
data science course in pune
data science course in punedata science course in pune
data science course in pune
 
The LinkedIn Effect: A new way of learning? OU Conference presentation
The LinkedIn Effect: A new way of learning?  OU Conference presentationThe LinkedIn Effect: A new way of learning?  OU Conference presentation
The LinkedIn Effect: A new way of learning? OU Conference presentation
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 

More from Carla Marini

Tox Framework
Tox FrameworkTox Framework
Tox Framework
Carla Marini
 
Segmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retailSegmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retail
Carla Marini
 
Alidata Experience - Alitalia Customer Satisfaction
Alidata Experience -  Alitalia Customer SatisfactionAlidata Experience -  Alitalia Customer Satisfaction
Alidata Experience - Alitalia Customer Satisfaction
Carla Marini
 
Company Clustering Based on skills they seek in the job market
Company Clustering Based on skills they seek in the job marketCompany Clustering Based on skills they seek in the job market
Company Clustering Based on skills they seek in the job market
Carla Marini
 
PW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your projectPW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your project
Carla Marini
 
Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8
Carla Marini
 
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
Carla Marini
 
JOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData AnalyticsJOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData Analytics
Carla Marini
 

More from Carla Marini (8)

Tox Framework
Tox FrameworkTox Framework
Tox Framework
 
Segmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retailSegmentazione della clientela e Customer Churn Analysis nel retail
Segmentazione della clientela e Customer Churn Analysis nel retail
 
Alidata Experience - Alitalia Customer Satisfaction
Alidata Experience -  Alitalia Customer SatisfactionAlidata Experience -  Alitalia Customer Satisfaction
Alidata Experience - Alitalia Customer Satisfaction
 
Company Clustering Based on skills they seek in the job market
Company Clustering Based on skills they seek in the job marketCompany Clustering Based on skills they seek in the job market
Company Clustering Based on skills they seek in the job market
 
PW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your projectPW Master BI&BDA - KickStarter: a cool way to fund your project
PW Master BI&BDA - KickStarter: a cool way to fund your project
 
Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8Presentazione Master BI&BDA Edizione 8
Presentazione Master BI&BDA Edizione 8
 
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
ProjectWork Master BIBDA - Tracciamento all'indietro del percorso delle parti...
 
JOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData AnalyticsJOBME - Project Work Master in Business Intelligence & BigData Analytics
JOBME - Project Work Master in Business Intelligence & BigData Analytics
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights

  • 1. Fashion 1 insights Angela Ciliberti Michele di Padova Francesco Morazzoni Navid Nobani
  • 2. Reddit fashion insights: scope & phases Analyzing fashion-related comments in reddit, answering the following questions: 1. How many people talk and read about fashion in Reddit? 2. Are there any influencers? 3. Which are the most popular fashion related topics and brands? 4. Which is the sentiment with respect to a certain topic/brand and how does it evolve over time? SCOPE PROJECT PHASES word2vec reach & influencers
  • 3. • Subreddit: communities in which Reddit users are grouped if they are interested in the related topic. • Post/comment score: users can express their appreciation/disregard towards a certain post or comment, by upvoting or downvoting it. Each upvote is worth +1 , while each downvote -1. Proxy of engagement. • User karma: sum of upvotes and downvotes related to all the posts and comments produced by the user. How Reddit works Key features
  • 4. Scraping After launching «fashion» as search key, subreddits were selected according to their relevance and the largest number of subscribers: 1) Male Fashion Advice: 1.4 M 2) Streetwear: 0.8 M 3) Frugal male fashion: 0.7 M 4) Female fashion advice: 0.6 M Tools Where to scrape from The data to be scraped refers to the following dimensions: What to scrape Post-related • Post_Id • Post_Title • Post_Author • Post_Timestamp • Post_Points Comment-related • Comm_Id • Post_id • Comm_Body • Comm_Author • Comm_Timestamp • Comm_Points Results • 2 csv per subreddit (1 about Posts and 1 about Comments) • Only comments related to the top 1000 most popular posts per subreddit (due to API limit) • 660 K comments • Total csv size: about 250 MB Libraries PRAW datetime
  • 5. • Low % of comments written by inactive users (closed accounts) • Subscribers to FrugalFemaleFashion write on average more comments than subscribers of the other subreddit (4.8 vs 3.2 comments per user) and their comments are on average longer (256 characters per comment vs 139.7) Dataset overview: comments and users 1/2 • MaleFashionAdvice seems to be obsolete (the most popular 1000 posts gather comments mainly from 2013-2017) • Streetswear and FrugalFemaleFashion have mostly comments written in 2017-2018 161,4 134,6 256,0 98,5 Comments length (char)
  • 6. • Karma scores can be used to identify the most engaging users, i.e. those receiving the highest number of upvotes to their comments. This is a preliminary step for the identification of influencers. • Top 10 users by Karma are much more “productive” in terms of number of comments • Comments written by top10 users receive about twice the score of other users 826 3 3 35 267 287 270 Dataset overview: comments and users 2/2 Average # of comments per user
  • 7. Data cleaning 1. Delete comments having: • Missing id • Missing text • Missing timestamp 2. Delete comments having less than 15 characters 3. Delete comments not in English 4. Remove links 5. Remove strange characters 'n','r','*','$','&','[',']','(',')',«’» 6. Transform all text in lowercase 7. Remove stopwords (not done for sentiment analysis) Libraries Steps Example NLTK LANGID RE OS “He looks terrible... what are you people smoking? There's more than enough elegant and stylish apparel for people his age... he should rock a light blue three piece, gold pocketwatch and a white fedora or sth, but not this” “looks terrible... people smoking? theres enough elegant stylish apparel people age... rock light blue three piece, gold pocketwatch white fedora sth”
  • 8. Sentiment analysis Sentiment analysis has been done on preprocessed text, but without stopwords removal as this could have strongly decreased the accuracy of the outcome: some negative words are in the nltk stopwords list, so a phrase containing them such as «this is not good», would loose the «not» and so the sentiment would be wrongly assigned. Using textblob library, the polarity of each single comment was evaluated. • Is the subreddit community more engaged by positive, neutral or negative comments? In other words, is a positive comment more likely to have a higher score than a negative comment? • Does this vary depending on the subreddit?
  • 9. Topic – LDA 1/5 • Latent Dirichlet Allocation (LDA) model for discovering the abstract “topics” that occur in our comments collection. LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words. The model has been applied on the comments in order to find out six topics.
  • 10. Topic –LDA 2/5 The analysis shows that there is not a prevalent topic
  • 11. Topic –LDA 3/5 The first three topics are close, some of the main words are: shoes, store, dress, outfit
  • 12. Topic –LDA 4/5 Another group is represented by the fourth and the fifth topic. Some of the principal words are: money,wallet, company, people.
  • 13. Topic –LDA 5/5 The last topic is the farthest from the others, and the main words are: man, shit, fuck
  • 14. Topic – Clustering 1/3 • In order to better investigate on the topics treatted in the Reddits comments, a new work flow have been developed: Document to verctor model has been applyed in ordet to compute the cosine similarity matrix. Doc2Vect Clustering LDA On each cluser an LDA model has been apply in order to give a title to each cluster. The comments have been clusterized in six groups using the kmeans algorithm fitted on the symilarity matrix. The k = 6 has been chosen looking at the shiluette score.
  • 15. Topic – Clustering 2/3 The comments are not perfectelly separated, this cause an overlapping in terms of topic in each cluster. After an LDA analysis we can named the clusters as follow: • Cluster 0 : shoes,bought,cheap • Cluster 1 : shoes,socks,people • Cluster 2 : people,good,price • Cluster 3: price,shoes,people • Cluster 4 : sale, time,shoes • Cluster 5: socks,price,buy
  • 16. Topic – Clustering 3/3 A sentiment analysis has been performed for each comment, then an average sentiment score has been assigned to each cluster. This analysis shows that the clusters don't differs neither from a sentiment point of view. The average sentiment is roughly close to zero everywhere.
  • 17. Word Embedding: Word2Vec Model We have decided to use Gensim package for word embedding. Right at the beginning we have faced two problems: 1. Model Tuning : gensim.models.Word2Vec has more than 20 hyper parameters 2. Model Evaluation : Not having an score/metric to compare performance of different models TooManyParametrs Nocomparisonmetric Simple solution : Using the default values of function: We didn’t use this solution ! Simple solution : Comparing models based on similar words they find (based on cosine similarity) for a specific words. We didn’t use this solution either !
  • 18. Word Embedding: HRRC for Model Evaluation Following the research done in Cornell university (Schnabel et al., 2015) , we have decided to develop our own “Intrinsic Evaluation ” method (HRRC : Human Rate-Rank Comparison) using the WordSim-353 dataset (Finkelstein et al., 2002). WordSim dataset contains 353 pairs of words and the average similarity score given a similarity score (0-10) by 16 people. HowitWorks Smart Student 4.62 1 − 501 − 1 11553 = 0.956 1 − 1728 − 1 11553 = 0.850 Smart Student Smart Student 4.62 10 = 0.462R2 R1_1 R1_2 delta 1 = 0.462 – 0.956 = -0.494 delta 2 = 0.462 – 0.850 = -0.388 There were 138 pair of words which existed both in our data and WS353 dataset This process has been repeated for all 138 pairs. To summarize these deltas as a single value, we have calculated the median of sum of squared deltas. .//0 = 123456(32895:) HR = Human Rate /10 MR = 1 − GHIJK LMNO PQ RHSMKTUKMVW XYZJ delta = HR-MR
  • 19. word2vec Model Tuning Developing HRRC we used AWC EC2 (t2.medium instance) to perform a grid-search considering the following hyper parameters: • Minimum length of comment to be considered in the model (from 30 to 45 characters) • Gensim.Word2Vec iter parameter (from 5 to 30) • Word2Vec algorithm (CBOW and Skip-Gram) • Size of output vector ( from 100 to 1400) BestParameters 3-DScatterplotofallHRRCvalues 672 Models 25.3 Hours
  • 20. Model Visualization - 1HierchicalClustering(800Comments) t-SNE-variousAlgorithmsandModelIterations Skip-GramCBOW
  • 21. Model Visualization - 2t-SNE-Grid-Search ~16Hour87Models
  • 23. 23 Model Visualization - 3 T-SNE visualization of Final Word2Vec Model 2D 3D
  • 24. NER – Named Entity Recognition Ner is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
  • 25. NER – spaCy An open-source library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. https://spacy.io/ FeaturesWhat is ? • Fastest syntactic parser in the world • Named entity recognition • Non-destructive tokenization • Support for 20+ languages • Pre-trained statistical models and word vectors • Easy deep learning integration • … • … Architecture