Deep Text Mining of Instagram Data Without Strong
Supervision
WI 2018 Santiago | International Conference on Web intelligence
Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin
KTH Royal Institute of Technology
kimham@kth.se
December 4, 2018
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19
Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
But what about Labeled Data?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Supervised learning: Iteratively Minimize The Loss Function: L(ˆy, y)
Prediction
Ground truth
Labeled Training Data is Still a Bottleneck
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 3 / 19
Research Problem: Clothing Prediction on Instagram
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Text Model






dress = 0
coat = 1
...
skirt = 0






Image Model Clothing Prediction
Instagram Post
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 4 / 19
This Paper: Text Classification Without Labeled Data
post1
post2
post3
postn
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
0
10
20
30
Mentions
Mention of brand “foo” over time
Text Mining Analytics





w1,1 . . . w1,n
... ... ...
wn,1 . . . wn,n





b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Word EmbeddingsNeural Networks Trends detection User recommendations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 5 / 19
Example Instagram Post
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 6 / 19
Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Challenge 2: Lack of Expensive Labeled Training Data
Raw Instagram Text Human Annotations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
Alternative Sources of Supervision That Are Cheap but
Weak
Strong supervision:
Manual annotation by
expert
Weak supervision: A
signal that does not
have full
coverage/perfect
accuracy
Sources of Weak Supervision
Domain Heuristics
Database
APIs
Crowdworkers
Combiner Strong supervision
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 8 / 19
Weak Supervision in the Fashion Domain
Open APIs:
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
Text mining system based on a fashion ontology and word embeddings:
Happy Monday! Here is my outfit of
the day #streetstyle #me #canada #goals
#chic #denim
Caption
Zalando user1 user2
Tags
I love the bag! Is it Gucci?
#goals @username
I #want the #baaag
Wow! The #jeans You are suclh
an inspirationn, can you follow me back?
Comments
Ontology O
Brands
Items
Patterns
Materials
Styles
Instagram Post p ∈ P
ProBase
Word Rankings




w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Word Embeddings V
Edit-distance
tfidf (wi , p, P)
term-score t ∈
{caption, comment,
user-tag, hashtag}
Linear
Combination
Items: (bag, 0.63),
(jeans, 0.3), (top, 0.1)
Brands:
(Gucci, 0.8), (Zalando, 0.3)
Material: (Denim, 1.0)
...
Ranked Noisy Labels r
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
How To Combine Several Sources Of Weak Supervision?
Simplest way to combine many weak signals: Majority Vote
Recent research on combination of weak signals: Data
Programming2
2
Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL:
http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 10 / 19
Model Weak Supervision With Generative Model
unlabeled
data
Labeling functions
λ1 . . . λn
Weak labels



w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Generative Model
πα,β(Λ, Y )
Combined labels



w1
...
wn




Model weak supervision as labeling functions λi
λi (unlabeled data) → label
Learn Generative Model πα,β(Λ, Y ) over the labeling process.
Based on conflicts between labeling functions assign the functions an
estimated accuracy αi .
Based on empirical coverage of labeling functions assign the functions
a coverage βi .
Given α and β for each labeling function, it can be used to
combine labels into a single probabilistic label
Give more weight to high-accuracy functions
If there is a lot of disagreement→ low probability label
If all labeling functions agree → high probability label
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 11 / 19
Data Programming Intuition
Low accuracy labeling functions High accuracy labeling functions
“it is a coat”
“it is not a coat”
Probabilistic Label: 0.6 probability that it is a coat
Majority Vote: 1.0 probability that it is not a coat
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 12 / 19
Extension of Data Programming to Multi-Label
Classification
Problem: Data programming only defined for binary
classification in original paper
To make it work for multi-class setting: model labeling function as
λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}.
Idea 1 for multi-label: model labeling function as
λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1}
Idea 2 for multi-label: learn a separate generative model for each
class, and let each labeling function give binary output for each class
λi,j → ki,j ∈ {−1, 0, 1}.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 13 / 19
Trained Generative Models: Labeling Functions’ Accuracy
Differ Between Classes
accessories
bags
blouses
coats
dresses
jackets
jeans
cardigans
shoes
skirts
tights
tops
trousers
Classes
0.4
0.6
0.8
1.0
Accuracy
Predicted accuracy in generative model
Clarifai
Deepomatic
DeepDetect
Google Cloud Vision
SemCluster
KeywordSyntactic
KeywordSemantic
Figure: Multiple generative models can capture a different accuracy for labeling
functions for different classes.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 14 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
3 Use the combined labels for training a discriminative model using
supevised machine learning.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Pipeline for Weakly Supervised Classification in Instagram
Problem: A Multi-class Multi-label classification problem with 13 output
classes (dresses, coats, blouses, jeans, ...)
Here
is my
out-
fit of
the day
#street-
style
#coat
#parka
#chic
#win-
ter
Labeling Functions λi
SemCluster
KeyWordSyntactic
KeyWordSemantic
DeepDetect






dress = 0
coat = 1
...
skirt = 0






Votes vi
jacket,jeans
jeans,coat
jeans,shoes
nil
coat,jeans
coat
coat
Generative Model πα,β(Λ, Y )
λ1
λ2
λ3
λ4
λ5
λ6
λ7
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
Discriminative Model d
CNN for Text classification
Figure: A pipeline for weakly supervised text classification of Instagram posts.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 16 / 19
Data Programming Beats Majority Voting
Results
Data programming gives 6 F1 points improvement over majority
vote3, achieving an F1 score of 0.61 (On level with human
performance)
Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss
CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02
CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03
DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184
Main cause of error: data sparsity (can not extract clothing
items from the text if it is never mentioned in the text)
3
A smaller, hand-labeled dataset by experts was used for evaluation
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 17 / 19
Conclusion
Instagram text is jus as noisy as Twitter, has a long-tail distribution,
and is multi-lingual
In shifting data domains where accurate labeled data is a rarity, like
social media, weak supervision is a viable alternative.
Combining weak labels with generative modeling beats majority voting.
To extend Data programming to the multi-label scenario, a collection
of generative models can be used to incorporate per-class accuracy.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 18 / 19
Thank you
All code and most of the data is open source:
https://github.com/shatha2014/FashionRec
Questions?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 19 / 19

Kim Hammar - Paper presentation WI 2018 Santiago

  • 1.
    Deep Text Miningof Instagram Data Without Strong Supervision WI 2018 Santiago | International Conference on Web intelligence Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin KTH Royal Institute of Technology kimham@kth.se December 4, 2018 Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19
  • 2.
    Key enabler forDeep Learning: Data growth 2009 2012 2015 2017 2020 2023 2026 0 50 100 150 Year Zettabytes Annual Size of the Global Datasphere. Source: IDC Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
  • 3.
    Key enabler forDeep Learning: Data growth 2009 2012 2015 2017 2020 2023 2026 0 50 100 150 Year Zettabytes Annual Size of the Global Datasphere. Source: IDC But what about Labeled Data? Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
  • 4.
    b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Supervised learning: IterativelyMinimize The Loss Function: L(ˆy, y) Prediction Ground truth Labeled Training Data is Still a Bottleneck Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 3 / 19
  • 5.
    Research Problem: ClothingPrediction on Instagram b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Text Model       dress = 0 coat = 1 ... skirt = 0       Image Model Clothing Prediction Instagram Post Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 4 / 19
  • 6.
    This Paper: TextClassification Without Labeled Data post1 post2 post3 postn 04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 0 10 20 30 Mentions Mention of brand “foo” over time Text Mining Analytics      w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n      b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Word EmbeddingsNeural Networks Trends detection User recommendations Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 5 / 19
  • 7.
    Example Instagram Post KimHammar (KTH) Text Mining in Social Media December 4, 2018 6 / 19
  • 8.
    Challenge: Noisy Textand No Labels A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments Challenge 1: Noisy Text with a Long-Tail Distribution 100 101 102 103 104 105 Log count 100 101 102 103 104 Logfrequency Posts with 0 comments Posts with 0 words (comments+caption+tags) Log-Log plot over the frequency of text per post Comments Words Text Statistic Fraction of corpora size Average/post Emojis 0.15 48.63 Hashtags 0.03 9.14 User-handles 0.06 18.62 Google-OOV words 0.46 145.02 Aspell-OOV words 0.47 147.61 Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
  • 9.
    Challenge: Noisy Textand No Labels A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments Challenge 1: Noisy Text with a Long-Tail Distribution 100 101 102 103 104 105 Log count 100 101 102 103 104 Logfrequency Posts with 0 comments Posts with 0 words (comments+caption+tags) Log-Log plot over the frequency of text per post Comments Words Text Statistic Fraction of corpora size Average/post Emojis 0.15 48.63 Hashtags 0.03 9.14 User-handles 0.06 18.62 Google-OOV words 0.46 145.02 Aspell-OOV words 0.47 147.61 Challenge 2: Lack of Expensive Labeled Training Data Raw Instagram Text Human Annotations Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
  • 10.
    Alternative Sources ofSupervision That Are Cheap but Weak Strong supervision: Manual annotation by expert Weak supervision: A signal that does not have full coverage/perfect accuracy Sources of Weak Supervision Domain Heuristics Database APIs Crowdworkers Combiner Strong supervision Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 8 / 19
  • 11.
    Weak Supervision inthe Fashion Domain Open APIs: 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 12.
    Weak Supervision inthe Fashion Domain Open APIs: Pre-trained Clothing Classificiation Models: DeepDetect1 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 13.
    Weak Supervision inthe Fashion Domain Open APIs: Pre-trained Clothing Classificiation Models: DeepDetect1 Text mining system based on a fashion ontology and word embeddings: Happy Monday! Here is my outfit of the day #streetstyle #me #canada #goals #chic #denim Caption Zalando user1 user2 Tags I love the bag! Is it Gucci? #goals @username I #want the #baaag Wow! The #jeans You are suclh an inspirationn, can you follow me back? Comments Ontology O Brands Items Patterns Materials Styles Instagram Post p ∈ P ProBase Word Rankings     w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n     Word Embeddings V Edit-distance tfidf (wi , p, P) term-score t ∈ {caption, comment, user-tag, hashtag} Linear Combination Items: (bag, 0.63), (jeans, 0.3), (top, 0.1) Brands: (Gucci, 0.8), (Zalando, 0.3) Material: (Denim, 1.0) ... Ranked Noisy Labels r 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 14.
    How To CombineSeveral Sources Of Weak Supervision? Simplest way to combine many weak signals: Majority Vote Recent research on combination of weak signals: Data Programming2 2 Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL: http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 10 / 19
  • 15.
    Model Weak SupervisionWith Generative Model unlabeled data Labeling functions λ1 . . . λn Weak labels    w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n     Generative Model πα,β(Λ, Y ) Combined labels    w1 ... wn     Model weak supervision as labeling functions λi λi (unlabeled data) → label Learn Generative Model πα,β(Λ, Y ) over the labeling process. Based on conflicts between labeling functions assign the functions an estimated accuracy αi . Based on empirical coverage of labeling functions assign the functions a coverage βi . Given α and β for each labeling function, it can be used to combine labels into a single probabilistic label Give more weight to high-accuracy functions If there is a lot of disagreement→ low probability label If all labeling functions agree → high probability label Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 11 / 19
  • 16.
    Data Programming Intuition Lowaccuracy labeling functions High accuracy labeling functions “it is a coat” “it is not a coat” Probabilistic Label: 0.6 probability that it is a coat Majority Vote: 1.0 probability that it is not a coat Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 12 / 19
  • 17.
    Extension of DataProgramming to Multi-Label Classification Problem: Data programming only defined for binary classification in original paper To make it work for multi-class setting: model labeling function as λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}. Idea 1 for multi-label: model labeling function as λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1} Idea 2 for multi-label: learn a separate generative model for each class, and let each labeling function give binary output for each class λi,j → ki,j ∈ {−1, 0, 1}. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 13 / 19
  • 18.
    Trained Generative Models:Labeling Functions’ Accuracy Differ Between Classes accessories bags blouses coats dresses jackets jeans cardigans shoes skirts tights tops trousers Classes 0.4 0.6 0.8 1.0 Accuracy Predicted accuracy in generative model Clarifai Deepomatic DeepDetect Google Cloud Vision SemCluster KeywordSyntactic KeywordSemantic Figure: Multiple generative models can capture a different accuracy for labeling functions for different classes. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 14 / 19
  • 19.
    Putting Everything Together 1Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 20.
    Putting Everything Together 1Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) 2 Combine labels using majority voting or generative modelling (data programming) Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 21.
    Putting Everything Together 1Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) 2 Combine labels using majority voting or generative modelling (data programming) 3 Use the combined labels for training a discriminative model using supevised machine learning. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 22.
    Pipeline for WeaklySupervised Classification in Instagram Problem: A Multi-class Multi-label classification problem with 13 output classes (dresses, coats, blouses, jeans, ...) Here is my out- fit of the day #street- style #coat #parka #chic #win- ter Labeling Functions λi SemCluster KeyWordSyntactic KeyWordSemantic DeepDetect       dress = 0 coat = 1 ... skirt = 0       Votes vi jacket,jeans jeans,coat jeans,shoes nil coat,jeans coat coat Generative Model πα,β(Λ, Y ) λ1 λ2 λ3 λ4 λ5 λ6 λ7 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 Discriminative Model d CNN for Text classification Figure: A pipeline for weakly supervised text classification of Instagram posts. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 16 / 19
  • 23.
    Data Programming BeatsMajority Voting Results Data programming gives 6 F1 points improvement over majority vote3, achieving an F1 score of 0.61 (On level with human performance) Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02 CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03 DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184 Main cause of error: data sparsity (can not extract clothing items from the text if it is never mentioned in the text) 3 A smaller, hand-labeled dataset by experts was used for evaluation Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 17 / 19
  • 24.
    Conclusion Instagram text isjus as noisy as Twitter, has a long-tail distribution, and is multi-lingual In shifting data domains where accurate labeled data is a rarity, like social media, weak supervision is a viable alternative. Combining weak labels with generative modeling beats majority voting. To extend Data programming to the multi-label scenario, a collection of generative models can be used to incorporate per-class accuracy. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 18 / 19
  • 25.
    Thank you All codeand most of the data is open source: https://github.com/shatha2014/FashionRec Questions? Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 19 / 19