SlideShare a Scribd company logo
No Training Data? No Problem!
Weak Supervision to the Rescue!
Marie Stephen Leo
Based on the Medium post of the same title
About Me
Director of Data Science @ Edelman DxI
(World’s largest Public Relations agency)
Part time Data Science Instructor @ General Assembly
✍ Top Writer in Artificial Intelligence @ Medium
🔬 Research Interests:
📝 NLP
🔎 Neural Search
⚙ MLOps
https://medium.com/@stephen-leo
www.linkedin.com/in/marie-stephen-leo
Marie Stephen Leo
📝 Agenda
🚧 The challenge of contemporary Machine Learning
💡 Enter Weak Supervision!
🧰 Weak Supervision Frameworks
🏗 Conclusion & Future Direction
🚧 The challenge of contemporary Machine Learning
● ML requires substantial amounts of manually labeled training data
○ ImageNet contains 14 Million manually annotated images!
● Transfer Learning improves this situation
○ But most models still require few hundreds to thousands of high quality labels to
finetune models such as BERT!
○ For eg., to build a sentiment analysis model, someone should first manually read a
few thousand comments and mention what’s the sentiment of each comment!
● Labeling data is
○ 💰 Costly
○ ⏱ Time Consuming
○ 🏋 Labor Intensive
○ 💢 Prone to human errors and biases
○ 🤷 Not the priority for subject matter experts in the business
🚧 The challenge of contemporary Machine Learning
● At the same time, unlabeled data is vastly abundant!
● Most organizations have an immense depth of domain
knowledge in boolean queries, heuristic rules, or tribal
knowledge that don't get used in ML models.
● What If?
○ Can we leverage vast stores of domain knowledge in our
organisations to solve the labeling problem?
○ Can we label all the unlabeled data programmatically?
○ This would result in ML algorithms learning from domain
subject matter experts rather than some poor intern
labelers with vastly more data than manual labels could
ever collect!
Enter Weak Supervision!
💡 Weak Supervision: The shift to Data Centric AI
Data-centric AI is the discipline of systematically engineering the data used to build an AI system -
Andrew Ng (https://datacentricai.org/)
Model Centric AI (The 2010s)
Training Data:
Fixed
Model:
Iterate
+/- 1% accuracy change
Data Centric AI (2020s)
Training Data:
Iterate
+/- 10% accuracy change
Model:
Fixed
💡 Weak Supervision: 💰 A Billion Dollar Industry!
💡 Weak Supervision in one picture - Enabling Data Centric AI!
Reduce the efforts of manual labeling while unlocking the vast knowledge of domain subject matter
experts (SMEs) by leveraging a diversity of weaker, often programmatic supervision sources.
💡 Weak Supervision details
1⃣ Writing Labeling Functions (LFs)
2⃣ Combining them with Label Model (LM)
3⃣ Training a Downstream End Model (EM)
4⃣ Iterate!
During Training, Weak Supervision in general
has 4 steps
Training Data!
During inference, we discard everything and
only use the EM directly to make predictions!
Hence no different from normal ML.
💡 Weak Supervision details - 1⃣ Labeling Functions (LF)
● Any Python function that takes in one datapoint as input and
returns either one label as output or abstains.
● Can be anything! Keywords, heuristics, search queries, outputs of
other models (eg. Zero shot), labels from interns, etc.
● Use the Snorkel python library [https://github.com/snorkel-team/snorkel]
● Not expected to be perfect! The next steps will denoise them.
💡 Weak Supervision details - 2⃣ Label Model (LM)
● If we have n LFs, then each row will get max n labels (LFs can
abstain if they are not sure).
● We need to aggregate the outputs of the n individual LFs so that
each row only has one label.
● Majority Vote is the simplest Label Model.
● There are better ways! We can use the agreements and
disagreements between the various LFs with some matrix math.
○ Does not need any ground truth data at all! [Data Programming
Paper] [MeTaL Paper] [Flying Squid paper] [Poster] [Talk]
○ In practice having a small labeled validation set (~100 rows)
helps to convince yourself (and your boss!) that you’re doing
the correct thing.
💡 Weak Supervision details - 3⃣ End Model (EM)
● The output of the Label Model (LM) is one Weak Label for each row
generated by combining all the weak LFs.
● Use these weak labels as the training data to fine tune a downstream
pre-trained model to generalize beyond the weak labels.
○ Large pre-trained models such as BERT already have
tremendous understanding of our language.
○ Fine Tuning BERT on weak labels is sufficient for it to learn the
task, even beyond the weak labels.
● Since the LFs are programmatic labeling sources, we can run the LFs
and LM on our entire unlabeled corpus to generate many labels.
● The EM benefits from the more extensive training datasets created
and incorporates the domain knowledge of SMEs! Win - Win!
Training Data!
💡 Weak Supervision details - Inference Time
Data Prediction
● But don’t throw away your LFs and LM just yet!
● You can reuse them for model retraining at a regular cadence or model
monitoring for performance degradation over time.
● LF creation work is a one time effort vs labeling every time your model drifts!
Weak Supervision Frameworks
🧰 Weak Supervision Frameworks - 🔧 WRENCH
[WRENCH Paper] [Github]
🧰 Weak Supervision Frameworks - 🔧 WRENCH
Despite not using any labeled data to train,
Weak Supervised models with appropriate LFs
can achieve performance that’s close to fully
supervised models on many tasks!
[WRENCH Paper] [Github]
🧰 Weak Supervision Frameworks - Snorkel
[Data Programming (DP) Paper] [MeTaL Paper] [Github] [Poster]
A matrix completion problem that is solved with SGD [Talk]
🧰 Weak Supervision Frameworks - 📐 COSINE
[COSINE Paper] [Github] [WRENCH Implementation]
COSINE is short for COntrastive Self-training for fINE-Tuning Pretrained Language Model
Initialization
Sample
Reweighting
Classification Loss
on high confidence
samples
Contrastive Loss
on high confidence
samples
Confidence
regularization on
all samples
🧰 Weak Supervision Frameworks - 🔎 Heuristic LF selection
● In real world testing, accuracy can vary a lot depending on quality of LFs selected.
● Our solution is to use a small hand labeled validation dataset or iterative active learning to
choose the best LFs from an LF Zoo.
● Highly iterative process, can start with a small number of LFs and refine them over time. The
analysis could also expose gaps in our understanding of the problem domain!
Conclusion & Future Direction
🏗 Conclusion
● Shift to Data Centric AI
● Weak Supervision for programmatic data labeling
○ 1⃣ Writing Labeling Functions (LFs)
○ 2⃣ Combining them with Label Model (LM)
○ 3⃣ Training a Downstream End Model (EM)
○ 4⃣ Iterate!
● Weak Supervision frameworks
○ 🔧 WRENCH
○ Snorkel
○ 📐 COSINE
○ 🔎 Heuristic LF selection
🏗 Future Direction
● More research into augmenting domain knowledge LFs
with automated LFs
○ Want To Reduce Labeling Cost? GPT-3 Can Help [Paper] [Github]
○ X-Class: Text Classification with Extremely Weak Supervision [Paper]
[Github]
○ 󰐵 OptimSeed: Seed Word Selection for Weakly-Supervised Text
Classification with Unsupervised Error Estimation: [Paper] [Github]
● The Rise of UI based tools since Weak Supervision relies
heavily on SMEs who may not be coding experts!
○ 🌟 Open Source: Rubrix
○ 💰 Commercial: Snorkel Flow ($1Billion at work!)
📚 Resources
● Medium Post that this talk is based on: Link
● Snorkel Tutorials: Snorkel Website
● Collection of resources on Data Centric AI: Link
● Cool Icons: Flaticon
● Papers: Arxiv
● O’Reilly Book: Link
Questions?

More Related Content

What's hot

第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
Tsuyoshi Hirayama
 
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Dobo Radichkov
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
SynaptonIncorporated
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
Generative AI and Student Writing.pptx
Generative AI and Student Writing.pptxGenerative AI and Student Writing.pptx
Generative AI and Student Writing.pptx
Mike Sharples
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
Deep Learning JP
 
Product Management for AI by Google PM
Product Management for AI by Google PMProduct Management for AI by Google PM
Product Management for AI by Google PM
Product School
 
Low Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdfLow Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdf
Denis Gagné
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
RkRahul16
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email Marketing
Databricks
 
Awesome Prompts Naveed
Awesome Prompts NaveedAwesome Prompts Naveed
Awesome Prompts Naveed
Naveed Ahmed Siddiqui
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
Michal Jaskolski
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
Grigory Sapunov
 
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
Lviv Startup Club
 
How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023
Dominik Lukes
 
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
カリス 東大AI博士
 
H transformer-1d paper review!!
H transformer-1d paper review!!H transformer-1d paper review!!
H transformer-1d paper review!!
taeseon ryu
 

What's hot (20)

第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
第73回 Machine Learning 15minutes ! IBM AI Foundation Modelsへの取り組み
 
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsHolland & Barrett: Gen AI Prompt Engineering for Tech teams
Holland & Barrett: Gen AI Prompt Engineering for Tech teams
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Generative AI and Student Writing.pptx
Generative AI and Student Writing.pptxGenerative AI and Student Writing.pptx
Generative AI and Student Writing.pptx
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
 
Product Management for AI by Google PM
Product Management for AI by Google PMProduct Management for AI by Google PM
Product Management for AI by Google PM
 
Low Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdfLow Code Neuro-Symbolic Agents.pdf
Low Code Neuro-Symbolic Agents.pdf
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email Marketing
 
Awesome Prompts Naveed
Awesome Prompts NaveedAwesome Prompts Naveed
Awesome Prompts Naveed
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
Oleksandr Krakovetskyi: Generative AI: real cases and trends (UA)
 
How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023
 
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会(公聴会)発表スライド Patho...
 
H transformer-1d paper review!!
H transformer-1d paper review!!H transformer-1d paper review!!
H transformer-1d paper review!!
 

Similar to Weak Supervision.pdf

Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Yves Peirsman
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
Jedha Bootcamp
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
Knoldus Inc.
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
InfluxData
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
Awantik Das
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
Ortus Solutions, Corp
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
BigML, Inc
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
Benjamin Schulte
 
Best practices for structuring Machine Learning code
Best practices for structuring Machine Learning codeBest practices for structuring Machine Learning code
Best practices for structuring Machine Learning code
Erlangen Artificial Intelligence & Machine Learning Meetup
 
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
Edge AI and Vision Alliance
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
Edge AI and Vision Alliance
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Domain specific nlp pipelines
Domain specific nlp pipelinesDomain specific nlp pipelines
Domain specific nlp pipelines
Rajesh Muppalla
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Fwdays
 
DCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQDCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQ
LilianBernardin
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
Stefano Fago
 

Similar to Weak Supervision.pdf (20)

Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
 
VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1VSSML16 LR1. Summary Day 1
VSSML16 LR1. Summary Day 1
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
Best practices for structuring Machine Learning code
Best practices for structuring Machine Learning codeBest practices for structuring Machine Learning code
Best practices for structuring Machine Learning code
 
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Domain specific nlp pipelines
Domain specific nlp pipelinesDomain specific nlp pipelines
Domain specific nlp pipelines
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
DCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQDCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQ
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
 

Recently uploaded

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

Weak Supervision.pdf

  • 1. No Training Data? No Problem! Weak Supervision to the Rescue! Marie Stephen Leo Based on the Medium post of the same title
  • 2. About Me Director of Data Science @ Edelman DxI (World’s largest Public Relations agency) Part time Data Science Instructor @ General Assembly ✍ Top Writer in Artificial Intelligence @ Medium 🔬 Research Interests: 📝 NLP 🔎 Neural Search ⚙ MLOps https://medium.com/@stephen-leo www.linkedin.com/in/marie-stephen-leo Marie Stephen Leo
  • 3. 📝 Agenda 🚧 The challenge of contemporary Machine Learning 💡 Enter Weak Supervision! 🧰 Weak Supervision Frameworks 🏗 Conclusion & Future Direction
  • 4. 🚧 The challenge of contemporary Machine Learning ● ML requires substantial amounts of manually labeled training data ○ ImageNet contains 14 Million manually annotated images! ● Transfer Learning improves this situation ○ But most models still require few hundreds to thousands of high quality labels to finetune models such as BERT! ○ For eg., to build a sentiment analysis model, someone should first manually read a few thousand comments and mention what’s the sentiment of each comment! ● Labeling data is ○ 💰 Costly ○ ⏱ Time Consuming ○ 🏋 Labor Intensive ○ 💢 Prone to human errors and biases ○ 🤷 Not the priority for subject matter experts in the business
  • 5. 🚧 The challenge of contemporary Machine Learning ● At the same time, unlabeled data is vastly abundant! ● Most organizations have an immense depth of domain knowledge in boolean queries, heuristic rules, or tribal knowledge that don't get used in ML models. ● What If? ○ Can we leverage vast stores of domain knowledge in our organisations to solve the labeling problem? ○ Can we label all the unlabeled data programmatically? ○ This would result in ML algorithms learning from domain subject matter experts rather than some poor intern labelers with vastly more data than manual labels could ever collect!
  • 7. 💡 Weak Supervision: The shift to Data Centric AI Data-centric AI is the discipline of systematically engineering the data used to build an AI system - Andrew Ng (https://datacentricai.org/) Model Centric AI (The 2010s) Training Data: Fixed Model: Iterate +/- 1% accuracy change Data Centric AI (2020s) Training Data: Iterate +/- 10% accuracy change Model: Fixed
  • 8. 💡 Weak Supervision: 💰 A Billion Dollar Industry!
  • 9. 💡 Weak Supervision in one picture - Enabling Data Centric AI! Reduce the efforts of manual labeling while unlocking the vast knowledge of domain subject matter experts (SMEs) by leveraging a diversity of weaker, often programmatic supervision sources.
  • 10. 💡 Weak Supervision details 1⃣ Writing Labeling Functions (LFs) 2⃣ Combining them with Label Model (LM) 3⃣ Training a Downstream End Model (EM) 4⃣ Iterate! During Training, Weak Supervision in general has 4 steps Training Data! During inference, we discard everything and only use the EM directly to make predictions! Hence no different from normal ML.
  • 11. 💡 Weak Supervision details - 1⃣ Labeling Functions (LF) ● Any Python function that takes in one datapoint as input and returns either one label as output or abstains. ● Can be anything! Keywords, heuristics, search queries, outputs of other models (eg. Zero shot), labels from interns, etc. ● Use the Snorkel python library [https://github.com/snorkel-team/snorkel] ● Not expected to be perfect! The next steps will denoise them.
  • 12. 💡 Weak Supervision details - 2⃣ Label Model (LM) ● If we have n LFs, then each row will get max n labels (LFs can abstain if they are not sure). ● We need to aggregate the outputs of the n individual LFs so that each row only has one label. ● Majority Vote is the simplest Label Model. ● There are better ways! We can use the agreements and disagreements between the various LFs with some matrix math. ○ Does not need any ground truth data at all! [Data Programming Paper] [MeTaL Paper] [Flying Squid paper] [Poster] [Talk] ○ In practice having a small labeled validation set (~100 rows) helps to convince yourself (and your boss!) that you’re doing the correct thing.
  • 13. 💡 Weak Supervision details - 3⃣ End Model (EM) ● The output of the Label Model (LM) is one Weak Label for each row generated by combining all the weak LFs. ● Use these weak labels as the training data to fine tune a downstream pre-trained model to generalize beyond the weak labels. ○ Large pre-trained models such as BERT already have tremendous understanding of our language. ○ Fine Tuning BERT on weak labels is sufficient for it to learn the task, even beyond the weak labels. ● Since the LFs are programmatic labeling sources, we can run the LFs and LM on our entire unlabeled corpus to generate many labels. ● The EM benefits from the more extensive training datasets created and incorporates the domain knowledge of SMEs! Win - Win! Training Data!
  • 14. 💡 Weak Supervision details - Inference Time Data Prediction ● But don’t throw away your LFs and LM just yet! ● You can reuse them for model retraining at a regular cadence or model monitoring for performance degradation over time. ● LF creation work is a one time effort vs labeling every time your model drifts!
  • 16. 🧰 Weak Supervision Frameworks - 🔧 WRENCH [WRENCH Paper] [Github]
  • 17. 🧰 Weak Supervision Frameworks - 🔧 WRENCH Despite not using any labeled data to train, Weak Supervised models with appropriate LFs can achieve performance that’s close to fully supervised models on many tasks! [WRENCH Paper] [Github]
  • 18. 🧰 Weak Supervision Frameworks - Snorkel [Data Programming (DP) Paper] [MeTaL Paper] [Github] [Poster] A matrix completion problem that is solved with SGD [Talk]
  • 19. 🧰 Weak Supervision Frameworks - 📐 COSINE [COSINE Paper] [Github] [WRENCH Implementation] COSINE is short for COntrastive Self-training for fINE-Tuning Pretrained Language Model Initialization Sample Reweighting Classification Loss on high confidence samples Contrastive Loss on high confidence samples Confidence regularization on all samples
  • 20. 🧰 Weak Supervision Frameworks - 🔎 Heuristic LF selection ● In real world testing, accuracy can vary a lot depending on quality of LFs selected. ● Our solution is to use a small hand labeled validation dataset or iterative active learning to choose the best LFs from an LF Zoo. ● Highly iterative process, can start with a small number of LFs and refine them over time. The analysis could also expose gaps in our understanding of the problem domain!
  • 21. Conclusion & Future Direction
  • 22. 🏗 Conclusion ● Shift to Data Centric AI ● Weak Supervision for programmatic data labeling ○ 1⃣ Writing Labeling Functions (LFs) ○ 2⃣ Combining them with Label Model (LM) ○ 3⃣ Training a Downstream End Model (EM) ○ 4⃣ Iterate! ● Weak Supervision frameworks ○ 🔧 WRENCH ○ Snorkel ○ 📐 COSINE ○ 🔎 Heuristic LF selection
  • 23. 🏗 Future Direction ● More research into augmenting domain knowledge LFs with automated LFs ○ Want To Reduce Labeling Cost? GPT-3 Can Help [Paper] [Github] ○ X-Class: Text Classification with Extremely Weak Supervision [Paper] [Github] ○ 󰐵 OptimSeed: Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation: [Paper] [Github] ● The Rise of UI based tools since Weak Supervision relies heavily on SMEs who may not be coding experts! ○ 🌟 Open Source: Rubrix ○ 💰 Commercial: Snorkel Flow ($1Billion at work!)
  • 24. 📚 Resources ● Medium Post that this talk is based on: Link ● Snorkel Tutorials: Snorkel Website ● Collection of resources on Data Centric AI: Link ● Cool Icons: Flaticon ● Papers: Arxiv ● O’Reilly Book: Link