This document describes research on weakly supervised deep text mining of Instagram data without labeled training data. The researchers developed a pipeline that uses various weak supervision sources like open APIs and pre-trained models to generate noisy labels for unlabeled Instagram posts. A generative model is then used to combine the noisy labels, and the combined labels are used to train a discriminative deep learning model for text classification tasks like clothing prediction. The researchers found that their approach using data programming to combine labels outperformed simple majority voting and achieved performance close to human levels.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for a talk given at "The Conference Formerly Known as Conversion Hotel" in November 2019. Covers what data science is, what data scientists do, and how you can start learning data science skills.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for a talk given at "The Conference Formerly Known as Conversion Hotel" in November 2019. Covers what data science is, what data scientists do, and how you can start learning data science skills.
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
In this talk we address two questions:
1) How to use structured data in web search?
2) How to gather structured data?
For the first question we identify valuable classes of data, present query classes that can benefit from structured data and describe architecture that combines keyword search with structured search.
For the second question we present Data Cloud: An ecosystem of data publishers, search engine (data cloud) and data consumers. We show connection form Data Cloud Strategy to classic notion in economics: network effect in two-sided markets. At the end of the talk an early demo implementation will be presented.
Fairness, Transparency, and Privacy in AI @LinkedInC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2V9zW73.
Krishnaram Kenthapadi talks about privacy breaches, algorithmic bias/discrimination issues observed in the Internet industry, regulations & laws, and techniques for achieving privacy and fairness in data-driven systems. He focusses on the application of privacy-preserving data mining and fairness-aware ML techniques in practice, by presenting case studies spanning different LinkedIn applications. Filmed at qconsf.com.
Krishnaram Kenthapadi is part of the AI team at LinkedIn, where he leads the transparency and privacy modeling efforts across different applications. He is LinkedIn's representative in Microsoft's AI and Ethics in Engineering & Research Committee. He shaped the technical roadmap for LinkedIn Salary product, and served as the relevance lead for the LinkedIn Careers & Talent Solutions Relevance team.
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
BigData and Machine Learning: Usage and Opportunities for your IT department
Talk presented at The Developer Conference in São Paulo - 12/0713
Mathieu DESPRIEE
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
Monday was another great conference by MinneAnalytics! #MinneFRAMA was a great success with over 1,100 attendees at Science Museum of Minnesota. Alison Rempel Brown is a great host! A Teradata colleague told me that her post about my presentation "blew up" with hits and she got over 2K views, and 60+ likes. I'm proud to be a part of this great #datascience organization brining #machinelearning and #artificialintelligence #analytics to our #bigdata clients. If you want my slides, here they are.
Social Technology
by Marti A. Hearst
We are in the midst of extraordinary
change in how people interact with one
another and with information. A
combination of advances in technology
and change in people's expectations is
altering the way products are sold,
scientific problems are solved, software
is written, elections are conducted, and
government is run.
People are social animals, and as Shirky
notes, we now have tools that are
flexible enough to match our in-built
social capabilities. Things can get
done that weren't possible before
because the right expertise, the missing
information, or a large enough group of
people can now be gathered together at
low cost.
These developments open a number of
interesting questions for NSF and CISE.
What are the key research problems? How
should these developments change how
research is conducted? How can the
intersection of social science and
technology research be aided or
improved? And how should this effect
how NSF researchers get involved with
relevant government efforts, including
transparent government, emergency
response, and citizen science?
In this talk I attempt to summarize
and put some structure around some of
these developments.
Let's Talk: fundamentals of conversational designNikita Lukianets
I had a pleasure to teach conversation design at Lviv Data Science Summer School. We’ve discussed architectural approaches, covered semantic funnels and goal-oriented conversations. This presentation was used as a support material and I decided to share it with a wider audience. There are multiple articles introducing chatbots as a concept, including the main architectural principles behind. I’m not going to talk about them here, but rather I am presenting the anatomy of conversation and useful resources to get started with design and development: links to platforms, dialog engines, prototyping tools, connectors, intent recognition, and conversation analytics tools.
Legitimate Business – Marketing/Comms #1: Digital StrategyBen Bland
Legitimate Business is an event series at Farset Labs in Belfast. Farset is a technically skilled community but there is a desire or opportunity for greater commercial skills. So Legitimate Business is our name for an event series aimed at developing these skills, to empower the community to launch their own ventures... successfully.
Please forgive the crappy formatting – I was editing this until the last minute.
Future of AI presentation to Norway delegation at IBM Watson West, 505 Howard Street, San Francisco, CA USA on March 12, 2019 by Jim Spohrer, IBM Director, Cognitive OpenTech
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Automated Intrusion Response - CDIS Spring Conference 2024Kim Hammar
Presentation at CDIS Spring Conference 2024.
The ubiquity and evolving nature of cyber attacks is of growing concern to industry and society. In response, the automation of security processes and functions is the focus of many current research efforts. In this talk we will present a framework for automated network intrusion response, in which we model the interaction between an attacker and a defender as a partially observed Markov game. Within this framework, reinforcement learning enables the controlled evolution of attack and defense strategies towards a Nash equilibrium through the process of self-play. To realize and experiment with the self-play process on a practical IT infrastructure, we have developed a software platform for creating digital twins, which provide two key functions for our framework: (i) a safe and realistic test environment; and (ii) a tool for evaluation that enables closed-loop learning of security strategies.
Data Cloud - Yury Lifshits - Yahoo! ResearchYury Lifshits
In this talk we address two questions:
1) How to use structured data in web search?
2) How to gather structured data?
For the first question we identify valuable classes of data, present query classes that can benefit from structured data and describe architecture that combines keyword search with structured search.
For the second question we present Data Cloud: An ecosystem of data publishers, search engine (data cloud) and data consumers. We show connection form Data Cloud Strategy to classic notion in economics: network effect in two-sided markets. At the end of the talk an early demo implementation will be presented.
Fairness, Transparency, and Privacy in AI @LinkedInC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2V9zW73.
Krishnaram Kenthapadi talks about privacy breaches, algorithmic bias/discrimination issues observed in the Internet industry, regulations & laws, and techniques for achieving privacy and fairness in data-driven systems. He focusses on the application of privacy-preserving data mining and fairness-aware ML techniques in practice, by presenting case studies spanning different LinkedIn applications. Filmed at qconsf.com.
Krishnaram Kenthapadi is part of the AI team at LinkedIn, where he leads the transparency and privacy modeling efforts across different applications. He is LinkedIn's representative in Microsoft's AI and Ethics in Engineering & Research Committee. He shaped the technical roadmap for LinkedIn Salary product, and served as the relevance lead for the LinkedIn Careers & Talent Solutions Relevance team.
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
BigData and Machine Learning: Usage and Opportunities for your IT department
Talk presented at The Developer Conference in São Paulo - 12/0713
Mathieu DESPRIEE
Minne analytics presentation 2018 12 03 final compressedBonnie Holub
Monday was another great conference by MinneAnalytics! #MinneFRAMA was a great success with over 1,100 attendees at Science Museum of Minnesota. Alison Rempel Brown is a great host! A Teradata colleague told me that her post about my presentation "blew up" with hits and she got over 2K views, and 60+ likes. I'm proud to be a part of this great #datascience organization brining #machinelearning and #artificialintelligence #analytics to our #bigdata clients. If you want my slides, here they are.
Social Technology
by Marti A. Hearst
We are in the midst of extraordinary
change in how people interact with one
another and with information. A
combination of advances in technology
and change in people's expectations is
altering the way products are sold,
scientific problems are solved, software
is written, elections are conducted, and
government is run.
People are social animals, and as Shirky
notes, we now have tools that are
flexible enough to match our in-built
social capabilities. Things can get
done that weren't possible before
because the right expertise, the missing
information, or a large enough group of
people can now be gathered together at
low cost.
These developments open a number of
interesting questions for NSF and CISE.
What are the key research problems? How
should these developments change how
research is conducted? How can the
intersection of social science and
technology research be aided or
improved? And how should this effect
how NSF researchers get involved with
relevant government efforts, including
transparent government, emergency
response, and citizen science?
In this talk I attempt to summarize
and put some structure around some of
these developments.
Let's Talk: fundamentals of conversational designNikita Lukianets
I had a pleasure to teach conversation design at Lviv Data Science Summer School. We’ve discussed architectural approaches, covered semantic funnels and goal-oriented conversations. This presentation was used as a support material and I decided to share it with a wider audience. There are multiple articles introducing chatbots as a concept, including the main architectural principles behind. I’m not going to talk about them here, but rather I am presenting the anatomy of conversation and useful resources to get started with design and development: links to platforms, dialog engines, prototyping tools, connectors, intent recognition, and conversation analytics tools.
Legitimate Business – Marketing/Comms #1: Digital StrategyBen Bland
Legitimate Business is an event series at Farset Labs in Belfast. Farset is a technically skilled community but there is a desire or opportunity for greater commercial skills. So Legitimate Business is our name for an event series aimed at developing these skills, to empower the community to launch their own ventures... successfully.
Please forgive the crappy formatting – I was editing this until the last minute.
Future of AI presentation to Norway delegation at IBM Watson West, 505 Howard Street, San Francisco, CA USA on March 12, 2019 by Jim Spohrer, IBM Director, Cognitive OpenTech
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Automated Intrusion Response - CDIS Spring Conference 2024Kim Hammar
Presentation at CDIS Spring Conference 2024.
The ubiquity and evolving nature of cyber attacks is of growing concern to industry and society. In response, the automation of security processes and functions is the focus of many current research efforts. In this talk we will present a framework for automated network intrusion response, in which we model the interaction between an attacker and a defender as a partially observed Markov game. Within this framework, reinforcement learning enables the controlled evolution of attack and defense strategies towards a Nash equilibrium through the process of self-play. To realize and experiment with the self-play process on a practical IT infrastructure, we have developed a software platform for creating digital twins, which provide two key functions for our framework: (i) a safe and realistic test environment; and (ii) a tool for evaluation that enables closed-loop learning of security strategies.
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
We formulate intrusion tolerance for a system with service replicas as a two-level optimal control problem. On the local control level, node controllers perform intrusion recoveries and on the global control level, a system controller manages the replication factor.
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
We study automated intrusion response and formulate the interaction between an attacker and a defender on an IT infrastructure as a stochastic game where attack and defense strategies evolve through reinforcement learning and self-play. Direct application of reinforcement learning to any non-trivial instantiation of this game is impractical due to the exponential growth of the state and action spaces with the number of components in the infrastructure. We propose a decompositional approach to deal with this challenge and prove that under assumptions generally met in practice the game decomposes into a) additive subgames on the workflow-level that can be optimized independently; and b) subgames on the component-level that satisfy the optimal substructure property. We further show that the optimal defender strategies on the component-level exhibit threshold structures. To solve the decomposed game we develop Decompositional Fictitious Self-Play (\dfsp), an efficient fictitious self-play algorithm that learns Nash equilibria through stochastic approximation. We show that \dfsp outperforms a state-of-the-art algorithm for our use case. To evaluate the learned strategies, we deploy them in a a virtual IT infrastructure in which we run real network intrusions and real response actions. From our experimental investigation we conclude that our approach can produce effective defender strategies for a practical IT infrastructure.
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
We study automated intrusion response and formulate the interaction between an attacker and a defender on an IT infrastructure as a stochastic game where attack and defense strategies evolve through reinforcement learning and self-play. Direct application of reinforcement learning to any non-trivial instantiation of this game is impractical due to the exponential growth of the state and action spaces with the number of components in the infrastructure. We propose a decompositional approach to deal with this challenge and prove that under assumptions generally met in practice, the game decomposes into a) additive subgames on the workflow-level that can be optimized independently; and b) subgames on the component-level that satisfy the optimal substructure property. We further show that the optimal defender strategies on the component-level exhibit threshold structures. To solve the decomposed game we develop Decompositional Fictitious Self-Play (\dfsp), an efficient fictitious self-play algorithm that learns Nash equilibria through stochastic approximation. We show that \dfsp outperforms a state-of-the-art algorithm for our use case. To evaluate the learned strategies, we deploy them in a a virtual IT infrastructure in which we run real network intrusions and real response actions. From our experimental investigation we conclude that our approach can produce effective defender strategies for a practical IT infrastructure.
Learning Optimal Intrusion Responses via DecompositionKim Hammar
We study automated intrusion response and formulate the interaction between an attacker and a defender on an IT infrastructure as a stochastic game where attack and defense strategies evolve through reinforcement learning and self-play. Direct application of reinforcement learning to any non-trivial instantiation of this game is impractical due to the exponential growth of the state and action spaces with the number of components in the infrastructure. We propose a decompositional approach to deal with this challenge and prove that under assumptions generally met in practice, the game decomposes into a) additive subgames on the workflow-level that can be optimized independently; and b) subgames on the component-level that satisfy the optimal substructure property. We further show that the optimal defender strategies on the component-level exhibit threshold structures. To solve the decomposed game we develop Decompositional Fictitious Self-Play (\dfsp), an efficient fictitious self-play algorithm that learns Nash equilibria through stochastic approximation. We show that \dfsp outperforms a state-of-the-art algorithm for our use case. To evaluate the learned strategies, we deploy them in a a virtual IT infrastructure in which we run real network intrusions and real response actions. From our experimental investigation we conclude that our approach can produce effective defender strategies for a practical IT infrastructure.
—We present a novel emulation system for creating
high-fidelity digital twins of IT infrastructures. The digital twins
replicate key functionality of the corresponding infrastructures
and allow to play out security scenarios in a safe environment.
We show that this capability can be used to automate the process
of finding effective security policies for a target infrastructure. In
our approach, a digital twin of the target infrastructure is used
to run security scenarios and collect data. The collected data is
then used to instantiate simulations of Markov decision processes
and learn effective policies through reinforcement learning, whose
performances are validated in the digital twin. This closed-loop
learning process executes iteratively and provides continuously
evolving and improving security policies. We apply our approach
to an intrusion response scenario. Our results show that the
digital twin provides the necessary evaluative feedback to learn
near-optimal intrusion response policies.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Kim Hammar - Paper presentation WI 2018 Santiago
1. Deep Text Mining of Instagram Data Without Strong
Supervision
WI 2018 Santiago | International Conference on Web intelligence
Kim Hammar, Shatha Jaradat, Nima Dokoohaki, and Mihhail Matskin
KTH Royal Institute of Technology
kimham@kth.se
December 4, 2018
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 1 / 19
2. Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
3. Key enabler for Deep Learning: Data growth
2009 2012 2015 2017 2020 2023 2026
0
50
100
150
Year
Zettabytes
Annual Size of the Global Datasphere. Source: IDC
But what about Labeled Data?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 2 / 19
5. Research Problem: Clothing Prediction on Instagram
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Text Model
dress = 0
coat = 1
...
skirt = 0
Image Model Clothing Prediction
Instagram Post
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 4 / 19
6. This Paper: Text Classification Without Labeled Data
post1
post2
post3
postn
04.2017
05.2017
06.2017
07.2017
08.2017
09.2017
10.2017
11.2017
12.2017
01.2018
02.2018
03.2018
0
10
20
30
Mentions
Mention of brand “foo” over time
Text Mining Analytics
w1,1 . . . w1,n
... ... ...
wn,1 . . . wn,n
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Word EmbeddingsNeural Networks Trends detection User recommendations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 5 / 19
8. Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
9. Challenge: Noisy Text and No Labels
A case study of a corpora with 143 fashion accounts, 200K posts, 9M comments
Challenge 1: Noisy Text with a Long-Tail Distribution
100
101
102
103
104
105
Log count
100
101
102
103
104
Logfrequency
Posts with
0 comments
Posts with
0 words
(comments+caption+tags)
Log-Log plot over the frequency of text per post
Comments
Words
Text Statistic Fraction of corpora size Average/post
Emojis 0.15 48.63
Hashtags 0.03 9.14
User-handles 0.06 18.62
Google-OOV words 0.46 145.02
Aspell-OOV words 0.47 147.61
Challenge 2: Lack of Expensive Labeled Training Data
Raw Instagram Text Human Annotations
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 7 / 19
10. Alternative Sources of Supervision That Are Cheap but
Weak
Strong supervision:
Manual annotation by
expert
Weak supervision: A
signal that does not
have full
coverage/perfect
accuracy
Sources of Weak Supervision
Domain Heuristics
Database
APIs
Crowdworkers
Combiner Strong supervision
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 8 / 19
11. Weak Supervision in the Fashion Domain
Open APIs:
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
12. Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
13. Weak Supervision in the Fashion Domain
Open APIs:
Pre-trained Clothing Classificiation Models:
DeepDetect1
Text mining system based on a fashion ontology and word embeddings:
Happy Monday! Here is my outfit of
the day #streetstyle #me #canada #goals
#chic #denim
Caption
Zalando user1 user2
Tags
I love the bag! Is it Gucci?
#goals @username
I #want the #baaag
Wow! The #jeans You are suclh
an inspirationn, can you follow me back?
Comments
Ontology O
Brands
Items
Patterns
Materials
Styles
Instagram Post p ∈ P
ProBase
Word Rankings
w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n
Word Embeddings V
Edit-distance
tfidf (wi , p, P)
term-score t ∈
{caption, comment,
user-tag, hashtag}
Linear
Combination
Items: (bag, 0.63),
(jeans, 0.3), (top, 0.1)
Brands:
(Gucci, 0.8), (Zalando, 0.3)
Material: (Denim, 1.0)
...
Ranked Noisy Labels r
1
https://github.com/jolibrain/deepdetect
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 9 / 19
14. How To Combine Several Sources Of Weak Supervision?
Simplest way to combine many weak signals: Majority Vote
Recent research on combination of weak signals: Data
Programming2
2
Alexander J Ratner et al. “Data Programming: Creating Large Training Sets, Quickly”. In: Advances in Neural
Information Processing Systems 29. Ed. by D. D. Lee et al. Curran Associates, Inc., 2016, pp. 3567–3575. URL:
http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 10 / 19
15. Model Weak Supervision With Generative Model
unlabeled
data
Labeling functions
λ1 . . . λn
Weak labels
w1,1 . . . w1,n
...
...
...
wn,1 . . . wn,n
Generative Model
πα,β(Λ, Y )
Combined labels
w1
...
wn
Model weak supervision as labeling functions λi
λi (unlabeled data) → label
Learn Generative Model πα,β(Λ, Y ) over the labeling process.
Based on conflicts between labeling functions assign the functions an
estimated accuracy αi .
Based on empirical coverage of labeling functions assign the functions
a coverage βi .
Given α and β for each labeling function, it can be used to
combine labels into a single probabilistic label
Give more weight to high-accuracy functions
If there is a lot of disagreement→ low probability label
If all labeling functions agree → high probability label
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 11 / 19
16. Data Programming Intuition
Low accuracy labeling functions High accuracy labeling functions
“it is a coat”
“it is not a coat”
Probabilistic Label: 0.6 probability that it is a coat
Majority Vote: 1.0 probability that it is not a coat
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 12 / 19
17. Extension of Data Programming to Multi-Label
Classification
Problem: Data programming only defined for binary
classification in original paper
To make it work for multi-class setting: model labeling function as
λi → ki ∈ {0, . . . , N} instead of λi → ki ∈ {−1, 0, 1}.
Idea 1 for multi-label: model labeling function as
λi → ki = {v0, . . . , vn} ∧ vj ∈ {−1, 0, 1}
Idea 2 for multi-label: learn a separate generative model for each
class, and let each labeling function give binary output for each class
λi,j → ki,j ∈ {−1, 0, 1}.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 13 / 19
18. Trained Generative Models: Labeling Functions’ Accuracy
Differ Between Classes
accessories
bags
blouses
coats
dresses
jackets
jeans
cardigans
shoes
skirts
tights
tops
trousers
Classes
0.4
0.6
0.8
1.0
Accuracy
Predicted accuracy in generative model
Clarifai
Deepomatic
DeepDetect
Google Cloud Vision
SemCluster
KeywordSyntactic
KeywordSemantic
Figure: Multiple generative models can capture a different accuracy for labeling
functions for different classes.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 14 / 19
19. Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
20. Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
21. Putting Everything Together
1 Apply weak supervision to unlabeled data (open APIs, pre-trained
models, domain heuristics etc.)
2 Combine labels using majority voting or generative modelling (data
programming)
3 Use the combined labels for training a discriminative model using
supevised machine learning.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 15 / 19
22. Pipeline for Weakly Supervised Classification in Instagram
Problem: A Multi-class Multi-label classification problem with 13 output
classes (dresses, coats, blouses, jeans, ...)
Here
is my
out-
fit of
the day
#street-
style
#coat
#parka
#chic
#win-
ter
Labeling Functions λi
SemCluster
KeyWordSyntactic
KeyWordSemantic
DeepDetect
dress = 0
coat = 1
...
skirt = 0
Votes vi
jacket,jeans
jeans,coat
jeans,shoes
nil
coat,jeans
coat
coat
Generative Model πα,β(Λ, Y )
λ1
λ2
λ3
λ4
λ5
λ6
λ7
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
Discriminative Model d
CNN for Text classification
Figure: A pipeline for weakly supervised text classification of Instagram posts.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 16 / 19
23. Data Programming Beats Majority Voting
Results
Data programming gives 6 F1 points improvement over majority
vote3, achieving an F1 score of 0.61 (On level with human
performance)
Model Accuracy Precision Recall Micro-F1 Macro-F1 Hamming Loss
CNN-DataProgramming 0.797 ± 0.01 0.566 ± 0.05 0.678 ± 0.04 0.616 ± 0.02 0.535 ± 0.01 0.195 ± 0.02
CNN-MajorityVote 0.739 ± 0.02 0.470 ± 0.06 0.686 ± 0.05 0.555 ± 0.03 0.465 ± 0.05 0.261 ± 0.03
DomainExpert 0.807 0.704 0.529 0.604 0.534 0.184
Main cause of error: data sparsity (can not extract clothing
items from the text if it is never mentioned in the text)
3
A smaller, hand-labeled dataset by experts was used for evaluation
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 17 / 19
24. Conclusion
Instagram text is jus as noisy as Twitter, has a long-tail distribution,
and is multi-lingual
In shifting data domains where accurate labeled data is a rarity, like
social media, weak supervision is a viable alternative.
Combining weak labels with generative modeling beats majority voting.
To extend Data programming to the multi-label scenario, a collection
of generative models can be used to incorporate per-class accuracy.
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 18 / 19
25. Thank you
All code and most of the data is open source:
https://github.com/shatha2014/FashionRec
Questions?
Kim Hammar (KTH) Text Mining in Social Media December 4, 2018 19 / 19