Kim Hammar - Paper presentation WI 2018 Santiago

Deep Text Mining of Instagram Data Without Strong
Supervision
WI 2018 Santiago | International Conference on Web intelligence
Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin
KTH Royal Institute of Technology
kimham@kth.se
December 4, 2018
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19

Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC

Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
But what about Labeled Data?

b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Supervised learning: Iteratively Minimize The Loss Function: L(ˆy, y)
Prediction
Ground truth
Labeled Training Data is Still a Bottleneck

Research Problem: Clothing Prediction on Instagram
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Text Model






dress = 0
coat = 1
...
skirt = 0






Image Model Clothing Prediction
Instagram Post

This Paper: Text Classiﬁcation Without Labeled Data
post1
post2
post3
postn
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
0
10
20
30
Mentions
Mention of brand “foo” over time
Text Mining Analytics





w1,1 . . . w1,n
... ... ...
wn,1 . . . wn,n





b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Word EmbeddingsNeural Networks Trends detection User recommendations

Example Instagram Post

Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61

Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Challenge 2: Lack of Expensive Labeled Training Data
Raw Instagram Text Human Annotations

Alternative Sources of Supervision That Are Cheap but
Weak
Strong supervision:
Manual annotation by
expert
Weak supervision: A
signal that does not
have full
coverage/perfect
accuracy
Sources of Weak Supervision
Domain Heuristics
Database
APIs
Crowdworkers
Combiner Strong supervision

Weak Supervision in the Fashion Domain
Open APIs:
1
https://github.com/jolibrain/deepdetect

Open APIs:
Pre-trained Clothing Classiﬁciation Models:
DeepDetect1
1

Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
Text mining system based on a fashion ontology and word embeddings:
Happy Monday! Here is my outfit of
the day #streetstyle #me #canada #goals
#chic #denim
Caption
Zalando user1 user2
Tags
I love the bag! Is it Gucci?
#goals @username
I #want the #baaag
Wow! The #jeans You are suclh
an inspirationn, can you follow me back?
Comments
Ontology O
Brands
Items
Patterns
Materials
Styles
Instagram Post p ∈ P
ProBase
Word Rankings




w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Word Embeddings V
Edit-distance
tfidf (wi , p, P)
term-score t ∈
{caption, comment,
user-tag, hashtag}
Linear
Combination
Items: (bag, 0.63),
(jeans, 0.3), (top, 0.1)
Brands:
(Gucci, 0.8), (Zalando, 0.3)
Material: (Denim, 1.0)
...
Ranked Noisy Labels r
1

How To Combine Several Sources Of Weak Supervision?
Simplest way to combine many weak signals: Majority Vote
Recent research on combination of weak signals: Data
Programming2
2
Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL:
http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf.

Model Weak Supervision With Generative Model
unlabeled
data
Labeling functions
λ1 . . . λn
Weak labels



w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Generative Model
πα,β(Λ, Y )
Combined labels



w1
...
wn




Model weak supervision as labeling functions λi
λi (unlabeled data) → label
Learn Generative Model πα,β(Λ, Y ) over the labeling process.
Based on conﬂicts between labeling functions assign the functions an
estimated accuracy αi .
Based on empirical coverage of labeling functions assign the functions
a coverage βi .
Given α and β for each labeling function, it can be used to
combine labels into a single probabilistic label
Give more weight to high-accuracy functions
If there is a lot of disagreement→ low probability label
If all labeling functions agree → high probability label

Data Programming Intuition
Low accuracy labeling functions High accuracy labeling functions
“it is a coat”
“it is not a coat”
Probabilistic Label: 0.6 probability that it is a coat
Majority Vote: 1.0 probability that it is not a coat

Extension of Data Programming to Multi-Label
Classification
Problem: Data programming only defined for binary
classification in original paper
To make it work for multi-class setting: model labeling function as
λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}.
Idea 1 for multi-label: model labeling function as
λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1}
Idea 2 for multi-label: learn a separate generative model for each
class, and let each labeling function give binary output for each class
λi,j → ki,j ∈ {−1, 0, 1}.

Trained Generative Models: Labeling Functions’ Accuracy
Differ Between Classes
accessories
bags
blouses
coats
dresses
jackets
jeans
cardigans
shoes
skirts
tights
tops
trousers
Classes
0.4
0.6
0.8
1.0
Accuracy
Predicted accuracy in generative model
Clarifai
Deepomatic
DeepDetect
Google Cloud Vision
SemCluster
KeywordSyntactic
KeywordSemantic
Figure: Multiple generative models can capture a different accuracy for labeling
functions for different classes.

Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)

2 Combine labels using majority voting or generative modelling (data
programming)

2 Combine labels using majority voting or generative modelling (data
programming)
3 Use the combined labels for training a discriminative model using
supevised machine learning.

Pipeline for Weakly Supervised Classification in Instagram
Problem: A Multi-class Multi-label classification problem with 13 output
classes (dresses, coats, blouses, jeans, ...)
Here
is my
out-
fit of
the day
#street-
style
#coat
#parka
#chic
#win-
ter
Labeling Functions λi
SemCluster
KeyWordSyntactic
KeyWordSemantic
DeepDetect






dress = 0
coat = 1
...
skirt = 0






Votes vi
jacket,jeans
jeans,coat
jeans,shoes
nil
coat,jeans
coat
coat
Generative Model πα,β(Λ, Y )
λ1
λ2
λ3
λ4
λ5
λ6
λ7
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
Discriminative Model d
CNN for Text classification
Figure: A pipeline for weakly supervised text classification of Instagram posts.

Data Programming Beats Majority Voting
Results
Data programming gives 6 F1 points improvement over majority
vote3, achieving an F1 score of 0.61 (On level with human
performance)
Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss
CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02
CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03
DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184
Main cause of error: data sparsity (can not extract clothing
items from the text if it is never mentioned in the text)
3
A smaller, hand-labeled dataset by experts was used for evaluation

Conclusion
Instagram text is jus as noisy as Twitter, has a long-tail distribution,
and is multi-lingual
In shifting data domains where accurate labeled data is a rarity, like
social media, weak supervision is a viable alternative.
Combining weak labels with generative modeling beats majority voting.
To extend Data programming to the multi-label scenario, a collection
of generative models can be used to incorporate per-class accuracy.

Thank you
All code and most of the data is open source:
https://github.com/shatha2014/FashionRec
Questions?

Kim Hammar - Paper presentation WI 2018 Santiago

More Related Content

Similar to Kim Hammar - Paper presentation WI 2018 Santiago

More from Kim Hammar

Recently uploaded

Kim Hammar - Paper presentation WI 2018 Santiago