SlideShare a Scribd company logo
1 of 25
Download to read offline
Deep Text Mining of Instagram Data Without Strong
Supervision
WI 2018 Santiago | International Conference on Web intelligence
Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin
KTH Royal Institute of Technology
kimham@kth.se
December 4, 2018
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19
Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
But what about Labeled Data?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Supervised learning: Iteratively Minimize The Loss Function: L(ˆy, y)
Prediction
Ground truth
Labeled Training Data is Still a Bottleneck
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 3 / 19
Research Problem: Clothing Prediction on Instagram
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Text Model






dress = 0
coat = 1
...
skirt = 0






Image Model Clothing Prediction
Instagram Post
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 4 / 19
This Paper: Text Classification Without Labeled Data
post1
post2
post3
postn
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
0
10
20
30
Mentions
Mention of brand “foo” over time
Text Mining Analytics





w1,1 . . . w1,n
... ... ...
wn,1 . . . wn,n





b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Word EmbeddingsNeural Networks Trends detection User recommendations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 5 / 19
Example Instagram Post
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 6 / 19
Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Challenge 2: Lack of Expensive Labeled Training Data
Raw Instagram Text Human Annotations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
Alternative Sources of Supervision That Are Cheap but
Weak
Strong supervision:
Manual annotation by
expert
Weak supervision: A
signal that does not
have full
coverage/perfect
accuracy
Sources of Weak Supervision
Domain Heuristics
Database
APIs
Crowdworkers
Combiner Strong supervision
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 8 / 19
Weak Supervision in the Fashion Domain
Open APIs:
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
Text mining system based on a fashion ontology and word embeddings:
Happy Monday! Here is my outfit of
the day #streetstyle #me #canada #goals
#chic #denim
Caption
Zalando user1 user2
Tags
I love the bag! Is it Gucci?
#goals @username
I #want the #baaag
Wow! The #jeans You are suclh
an inspirationn, can you follow me back?
Comments
Ontology O
Brands
Items
Patterns
Materials
Styles
Instagram Post p ∈ P
ProBase
Word Rankings




w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Word Embeddings V
Edit-distance
tfidf (wi , p, P)
term-score t ∈
{caption, comment,
user-tag, hashtag}
Linear
Combination
Items: (bag, 0.63),
(jeans, 0.3), (top, 0.1)
Brands:
(Gucci, 0.8), (Zalando, 0.3)
Material: (Denim, 1.0)
...
Ranked Noisy Labels r
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
How To Combine Several Sources Of Weak Supervision?
Simplest way to combine many weak signals: Majority Vote
Recent research on combination of weak signals: Data
Programming2
2
Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL:
http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 10 / 19
Model Weak Supervision With Generative Model
unlabeled
data
Labeling functions
λ1 . . . λn
Weak labels



w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n




Generative Model
πα,β(Λ, Y )
Combined labels



w1
...
wn




Model weak supervision as labeling functions λi
λi (unlabeled data) → label
Learn Generative Model πα,β(Λ, Y ) over the labeling process.
Based on conflicts between labeling functions assign the functions an
estimated accuracy αi .
Based on empirical coverage of labeling functions assign the functions
a coverage βi .
Given α and β for each labeling function, it can be used to
combine labels into a single probabilistic label
Give more weight to high-accuracy functions
If there is a lot of disagreement→ low probability label
If all labeling functions agree → high probability label
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 11 / 19
Data Programming Intuition
Low accuracy labeling functions High accuracy labeling functions
“it is a coat”
“it is not a coat”
Probabilistic Label: 0.6 probability that it is a coat
Majority Vote: 1.0 probability that it is not a coat
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 12 / 19
Extension of Data Programming to Multi-Label
Classification
Problem: Data programming only defined for binary
classification in original paper
To make it work for multi-class setting: model labeling function as
λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}.
Idea 1 for multi-label: model labeling function as
λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1}
Idea 2 for multi-label: learn a separate generative model for each
class, and let each labeling function give binary output for each class
λi,j → ki,j ∈ {−1, 0, 1}.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 13 / 19
Trained Generative Models: Labeling Functions’ Accuracy
Differ Between Classes
accessories
bags
blouses
coats
dresses
jackets
jeans
cardigans
shoes
skirts
tights
tops
trousers
Classes
0.4
0.6
0.8
1.0
Accuracy
Predicted accuracy in generative model
Clarifai
Deepomatic
DeepDetect
Google Cloud Vision
SemCluster
KeywordSyntactic
KeywordSemantic
Figure: Multiple generative models can capture a different accuracy for labeling
functions for different classes.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 14 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
3 Use the combined labels for training a discriminative model using
supevised machine learning.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
Pipeline for Weakly Supervised Classification in Instagram
Problem: A Multi-class Multi-label classification problem with 13 output
classes (dresses, coats, blouses, jeans, ...)
Here
is my
out-
fit of
the day
#street-
style
#coat
#parka
#chic
#win-
ter
Labeling Functions λi
SemCluster
KeyWordSyntactic
KeyWordSemantic
DeepDetect






dress = 0
coat = 1
...
skirt = 0






Votes vi
jacket,jeans
jeans,coat
jeans,shoes
nil
coat,jeans
coat
coat
Generative Model πα,β(Λ, Y )
λ1
λ2
λ3
λ4
λ5
λ6
λ7
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
Discriminative Model d
CNN for Text classification
Figure: A pipeline for weakly supervised text classification of Instagram posts.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 16 / 19
Data Programming Beats Majority Voting
Results
Data programming gives 6 F1 points improvement over majority
vote3, achieving an F1 score of 0.61 (On level with human
performance)
Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss
CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02
CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03
DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184
Main cause of error: data sparsity (can not extract clothing
items from the text if it is never mentioned in the text)
3
A smaller, hand-labeled dataset by experts was used for evaluation
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 17 / 19
Conclusion
Instagram text is jus as noisy as Twitter, has a long-tail distribution,
and is multi-lingual
In shifting data domains where accurate labeled data is a rarity, like
social media, weak supervision is a viable alternative.
Combining weak labels with generative modeling beats majority voting.
To extend Data programming to the multi-label scenario, a collection
of generative models can be used to incorporate per-class accuracy.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 18 / 19
Thank you
All code and most of the data is open source:
https://github.com/shatha2014/FashionRec
Questions?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 19 / 19

More Related Content

Similar to Kim Hammar - Paper presentation WI 2018 Santiago

OSINT using Twitter & Python
OSINT using Twitter & PythonOSINT using Twitter & Python
OSINT using Twitter & Python37point2
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
 
Text analytics on social media
Text analytics on social mediaText analytics on social media
Text analytics on social mediaVenkatramanan P.R.
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Fairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInFairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInC4Media
 
Big data insights part i
Big data insights   part iBig data insights   part i
Big data insights part iRaji Gogulapati
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Minne analytics presentation 2018 12 03 final compressed
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
 
Let's Talk: fundamentals of conversational design
Let's Talk: fundamentals of conversational designLet's Talk: fundamentals of conversational design
Let's Talk: fundamentals of conversational designNikita Lukianets
 
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdf
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdfWall Street Mastermind Sector Spotlight - Technology (October 2023).pdf
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdfSamShiah1
 
Legitimate Business – Marketing/Comms #1: Digital Strategy
Legitimate Business – Marketing/Comms #1: Digital StrategyLegitimate Business – Marketing/Comms #1: Digital Strategy
Legitimate Business – Marketing/Comms #1: Digital StrategyBen Bland
 
Minne analytics presentation 2018 12 03 final compressed
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
 
Open power meetup at h2o 20180325 v3
Open power meetup at h2o 20180325 v3Open power meetup at h2o 20180325 v3
Open power meetup at h2o 20180325 v3ISSIP
 
Norway 20190312 v3
Norway 20190312 v3Norway 20190312 v3
Norway 20190312 v3ISSIP
 
Portfolio_JenniHawk
Portfolio_JenniHawkPortfolio_JenniHawk
Portfolio_JenniHawkJenniHawk
 
Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7ISSIP
 
Ntegra 20180523 v10 copy.pptx
Ntegra 20180523 v10 copy.pptxNtegra 20180523 v10 copy.pptx
Ntegra 20180523 v10 copy.pptxISSIP
 
Kazakhstan digital media_at_svl 20191018 v5
Kazakhstan digital media_at_svl 20191018 v5Kazakhstan digital media_at_svl 20191018 v5
Kazakhstan digital media_at_svl 20191018 v5ISSIP
 

Similar to Kim Hammar - Paper presentation WI 2018 Santiago (20)

OSINT using Twitter & Python
OSINT using Twitter & PythonOSINT using Twitter & Python
OSINT using Twitter & Python
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Text analytics on social media
Text analytics on social mediaText analytics on social media
Text analytics on social media
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Fairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInFairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedIn
 
Big data insights part i
Big data insights   part iBig data insights   part i
Big data insights part i
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Minne analytics presentation 2018 12 03 final compressed
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressed
 
Social Technology
Social TechnologySocial Technology
Social Technology
 
Let's Talk: fundamentals of conversational design
Let's Talk: fundamentals of conversational designLet's Talk: fundamentals of conversational design
Let's Talk: fundamentals of conversational design
 
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdf
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdfWall Street Mastermind Sector Spotlight - Technology (October 2023).pdf
Wall Street Mastermind Sector Spotlight - Technology (October 2023).pdf
 
Legitimate Business – Marketing/Comms #1: Digital Strategy
Legitimate Business – Marketing/Comms #1: Digital StrategyLegitimate Business – Marketing/Comms #1: Digital Strategy
Legitimate Business – Marketing/Comms #1: Digital Strategy
 
Minne analytics presentation 2018 12 03 final compressed
Minne analytics presentation 2018 12 03 final   compressedMinne analytics presentation 2018 12 03 final   compressed
Minne analytics presentation 2018 12 03 final compressed
 
Open power meetup at h2o 20180325 v3
Open power meetup at h2o 20180325 v3Open power meetup at h2o 20180325 v3
Open power meetup at h2o 20180325 v3
 
Norway 20190312 v3
Norway 20190312 v3Norway 20190312 v3
Norway 20190312 v3
 
Portfolio_JenniHawk
Portfolio_JenniHawkPortfolio_JenniHawk
Portfolio_JenniHawk
 
Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7Sweden future of ai 20180921 v7
Sweden future of ai 20180921 v7
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
Ntegra 20180523 v10 copy.pptx
Ntegra 20180523 v10 copy.pptxNtegra 20180523 v10 copy.pptx
Ntegra 20180523 v10 copy.pptx
 
Kazakhstan digital media_at_svl 20191018 v5
Kazakhstan digital media_at_svl 20191018 v5Kazakhstan digital media_at_svl 20191018 v5
Kazakhstan digital media_at_svl 20191018 v5
 

More from Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingKim Hammar
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 

More from Kim Hammar (20)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal Stopping
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-Play
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 

Recently uploaded

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Recently uploaded (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Kim Hammar - Paper presentation WI 2018 Santiago

  • 1. Deep Text Mining of Instagram Data Without Strong Supervision WI 2018 Santiago | International Conference on Web intelligence Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin KTH Royal Institute of Technology kimham@kth.se December 4, 2018 Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19
  • 2. Key enabler for Deep Learning: Data growth 2009 2012 2015 2017 2020 2023 2026 0 50 100 150 Year Zettabytes Annual Size of the Global Datasphere. Source: IDC Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
  • 3. Key enabler for Deep Learning: Data growth 2009 2012 2015 2017 2020 2023 2026 0 50 100 150 Year Zettabytes Annual Size of the Global Datasphere. Source: IDC But what about Labeled Data? Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
  • 4. b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Supervised learning: Iteratively Minimize The Loss Function: L(ˆy, y) Prediction Ground truth Labeled Training Data is Still a Bottleneck Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 3 / 19
  • 5. Research Problem: Clothing Prediction on Instagram b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Text Model       dress = 0 coat = 1 ... skirt = 0       Image Model Clothing Prediction Instagram Post Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 4 / 19
  • 6. This Paper: Text Classification Without Labeled Data post1 post2 post3 postn 04.2017 05.2017 06.2017 07.2017 08.2017 09.2017 10.2017 11.2017 12.2017 01.2018 02.2018 03.2018 0 10 20 30 Mentions Mention of brand “foo” over time Text Mining Analytics      w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n      b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Word EmbeddingsNeural Networks Trends detection User recommendations Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 5 / 19
  • 7. Example Instagram Post Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 6 / 19
  • 8. Challenge: Noisy Text and No Labels A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments Challenge 1: Noisy Text with a Long-Tail Distribution 100 101 102 103 104 105 Log count 100 101 102 103 104 Logfrequency Posts with 0 comments Posts with 0 words (comments+caption+tags) Log-Log plot over the frequency of text per post Comments Words Text Statistic Fraction of corpora size Average/post Emojis 0.15 48.63 Hashtags 0.03 9.14 User-handles 0.06 18.62 Google-OOV words 0.46 145.02 Aspell-OOV words 0.47 147.61 Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
  • 9. Challenge: Noisy Text and No Labels A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments Challenge 1: Noisy Text with a Long-Tail Distribution 100 101 102 103 104 105 Log count 100 101 102 103 104 Logfrequency Posts with 0 comments Posts with 0 words (comments+caption+tags) Log-Log plot over the frequency of text per post Comments Words Text Statistic Fraction of corpora size Average/post Emojis 0.15 48.63 Hashtags 0.03 9.14 User-handles 0.06 18.62 Google-OOV words 0.46 145.02 Aspell-OOV words 0.47 147.61 Challenge 2: Lack of Expensive Labeled Training Data Raw Instagram Text Human Annotations Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
  • 10. Alternative Sources of Supervision That Are Cheap but Weak Strong supervision: Manual annotation by expert Weak supervision: A signal that does not have full coverage/perfect accuracy Sources of Weak Supervision Domain Heuristics Database APIs Crowdworkers Combiner Strong supervision Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 8 / 19
  • 11. Weak Supervision in the Fashion Domain Open APIs: 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 12. Weak Supervision in the Fashion Domain Open APIs: Pre-trained Clothing Classificiation Models: DeepDetect1 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 13. Weak Supervision in the Fashion Domain Open APIs: Pre-trained Clothing Classificiation Models: DeepDetect1 Text mining system based on a fashion ontology and word embeddings: Happy Monday! Here is my outfit of the day #streetstyle #me #canada #goals #chic #denim Caption Zalando user1 user2 Tags I love the bag! Is it Gucci? #goals @username I #want the #baaag Wow! The #jeans You are suclh an inspirationn, can you follow me back? Comments Ontology O Brands Items Patterns Materials Styles Instagram Post p ∈ P ProBase Word Rankings     w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n     Word Embeddings V Edit-distance tfidf (wi , p, P) term-score t ∈ {caption, comment, user-tag, hashtag} Linear Combination Items: (bag, 0.63), (jeans, 0.3), (top, 0.1) Brands: (Gucci, 0.8), (Zalando, 0.3) Material: (Denim, 1.0) ... Ranked Noisy Labels r 1 https://github.com/jolibrain/deepdetect Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
  • 14. How To Combine Several Sources Of Weak Supervision? Simplest way to combine many weak signals: Majority Vote Recent research on combination of weak signals: Data Programming2 2 Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL: http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 10 / 19
  • 15. Model Weak Supervision With Generative Model unlabeled data Labeling functions λ1 . . . λn Weak labels    w1,1 . . . w1,n ... ... ... wn,1 . . . wn,n     Generative Model πα,β(Λ, Y ) Combined labels    w1 ... wn     Model weak supervision as labeling functions λi λi (unlabeled data) → label Learn Generative Model πα,β(Λ, Y ) over the labeling process. Based on conflicts between labeling functions assign the functions an estimated accuracy αi . Based on empirical coverage of labeling functions assign the functions a coverage βi . Given α and β for each labeling function, it can be used to combine labels into a single probabilistic label Give more weight to high-accuracy functions If there is a lot of disagreement→ low probability label If all labeling functions agree → high probability label Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 11 / 19
  • 16. Data Programming Intuition Low accuracy labeling functions High accuracy labeling functions “it is a coat” “it is not a coat” Probabilistic Label: 0.6 probability that it is a coat Majority Vote: 1.0 probability that it is not a coat Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 12 / 19
  • 17. Extension of Data Programming to Multi-Label Classification Problem: Data programming only defined for binary classification in original paper To make it work for multi-class setting: model labeling function as λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}. Idea 1 for multi-label: model labeling function as λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1} Idea 2 for multi-label: learn a separate generative model for each class, and let each labeling function give binary output for each class λi,j → ki,j ∈ {−1, 0, 1}. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 13 / 19
  • 18. Trained Generative Models: Labeling Functions’ Accuracy Differ Between Classes accessories bags blouses coats dresses jackets jeans cardigans shoes skirts tights tops trousers Classes 0.4 0.6 0.8 1.0 Accuracy Predicted accuracy in generative model Clarifai Deepomatic DeepDetect Google Cloud Vision SemCluster KeywordSyntactic KeywordSemantic Figure: Multiple generative models can capture a different accuracy for labeling functions for different classes. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 14 / 19
  • 19. Putting Everything Together 1 Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 20. Putting Everything Together 1 Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) 2 Combine labels using majority voting or generative modelling (data programming) Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 21. Putting Everything Together 1 Apply weak supervision to unlabeled data (open APIs, pre-trained models, domain heuristics etc.) 2 Combine labels using majority voting or generative modelling (data programming) 3 Use the combined labels for training a discriminative model using supevised machine learning. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
  • 22. Pipeline for Weakly Supervised Classification in Instagram Problem: A Multi-class Multi-label classification problem with 13 output classes (dresses, coats, blouses, jeans, ...) Here is my out- fit of the day #street- style #coat #parka #chic #win- ter Labeling Functions λi SemCluster KeyWordSyntactic KeyWordSemantic DeepDetect       dress = 0 coat = 1 ... skirt = 0       Votes vi jacket,jeans jeans,coat jeans,shoes nil coat,jeans coat coat Generative Model πα,β(Λ, Y ) λ1 λ2 λ3 λ4 λ5 λ6 λ7 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 Discriminative Model d CNN for Text classification Figure: A pipeline for weakly supervised text classification of Instagram posts. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 16 / 19
  • 23. Data Programming Beats Majority Voting Results Data programming gives 6 F1 points improvement over majority vote3, achieving an F1 score of 0.61 (On level with human performance) Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02 CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03 DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184 Main cause of error: data sparsity (can not extract clothing items from the text if it is never mentioned in the text) 3 A smaller, hand-labeled dataset by experts was used for evaluation Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 17 / 19
  • 24. Conclusion Instagram text is jus as noisy as Twitter, has a long-tail distribution, and is multi-lingual In shifting data domains where accurate labeled data is a rarity, like social media, weak supervision is a viable alternative. Combining weak labels with generative modeling beats majority voting. To extend Data programming to the multi-label scenario, a collection of generative models can be used to incorporate per-class accuracy. Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 18 / 19
  • 25. Thank you All code and most of the data is open source: https://github.com/shatha2014/FashionRec Questions? Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 19 / 19