Handling problem of hand-labeled training data with data programming and weak supervision
Presentation from Confitura 2019 presented by Rafał Wojdan from Sotrender
Dss 2019 handling problem of hand-labeled training data with data programmi...Rafal Wojdan
Preparing training data has always been a problem in machine learning and now it becomes even bigger problem as omnipresent deep learning has an infinite hunger for data. This problem can be partially mitigated with transfer learning and semi-supervised or active learning, however, quite recently the new player gets on the stage: weak supervision. Weak supervision, along with data programming paradigm, allows to prepare and model data based on labels coming from domain heuristics, existing ground-truth data, ""weak"" classifiers and even unreliable non-expert annotators like crowdsourcing. The presentation will introduce and explain these new approaches.
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
Ezako is a startup specializing in time series analysis. Ezako helps its clients detect anomalies and label their time series data. It helps accelerate the labeling process and analyze vast amounts of data from a variety of sensors in real-time. The company provides anomaly insights and makes it easier for data scientists. Ezako is the creator of Upalgo, which is a time series data management tool that uses AI to automatically detect anomalies in streaming data.
During this webinar, Ezako will dive into how high-frequency sensors can generate huge amounts of data which can become desynchronized. This can result in data quality issues as it can contain errors and glitches. Ezako uses machine learning, labelling and feedback loops to identify these errors. Discover how the company helps improve its clients’ data quality and reduce the number of validation mistakes.
Keynote WFIoT2019 - Data Graph, Knowledge Graphs Ontologies, Internet of Thin...Amélie Gyrard
Keynote “Trends on Data Graphs & Security for the Internet of Things”
(Extended Version) #WF-IoT World Forum Internet of Things
Workshop on #Security and #Privacy for #InternetofThings and Cyber-Physical Systems #CPS
#Security #Toolbox #Attacks and #Countermeasures #STAC
#Security #KnowledgeGraphs #Ontologies
Speaker: Dr. Ghislain Atemezing(Research & Development Director, MONDECA, Paris, France) @gatemezing
Credits: Dr. Amelie Gyrard (Kno.e.sis, Wright State University, Ohio, USA)
Buying a Ferrari for your teenager? You may want to think twiceAl Zindiq
Data science teams have different levels of maturity and they need to be equipped with the right tools and infrastructure to make them more agile and ready. Here, I will be discussing a combination of open source tools and cloud managed services that can go hand-by-hand and grow with your data science teams needs as they mature.
Microsoft Data Science Technologies: Back Office EditionMark Tabladillo
Microsoft provides several technologies in and around SQL Server which can be used for casual to serious data science. This presentation provides an authoritative overview of five major options: SQL Server Analysis Services, Excel Add-in for SSAS, Semantic Search, Microsoft Azure Machine Learning, and F#. Also included are tips on working with Python and R. These technologies have been used by the presenter in various companies and industries. This presentation will emphasize the back office story for supporting big data processing.
Dss 2019 handling problem of hand-labeled training data with data programmi...Rafal Wojdan
Preparing training data has always been a problem in machine learning and now it becomes even bigger problem as omnipresent deep learning has an infinite hunger for data. This problem can be partially mitigated with transfer learning and semi-supervised or active learning, however, quite recently the new player gets on the stage: weak supervision. Weak supervision, along with data programming paradigm, allows to prepare and model data based on labels coming from domain heuristics, existing ground-truth data, ""weak"" classifiers and even unreliable non-expert annotators like crowdsourcing. The presentation will introduce and explain these new approaches.
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
Ezako is a startup specializing in time series analysis. Ezako helps its clients detect anomalies and label their time series data. It helps accelerate the labeling process and analyze vast amounts of data from a variety of sensors in real-time. The company provides anomaly insights and makes it easier for data scientists. Ezako is the creator of Upalgo, which is a time series data management tool that uses AI to automatically detect anomalies in streaming data.
During this webinar, Ezako will dive into how high-frequency sensors can generate huge amounts of data which can become desynchronized. This can result in data quality issues as it can contain errors and glitches. Ezako uses machine learning, labelling and feedback loops to identify these errors. Discover how the company helps improve its clients’ data quality and reduce the number of validation mistakes.
Keynote WFIoT2019 - Data Graph, Knowledge Graphs Ontologies, Internet of Thin...Amélie Gyrard
Keynote “Trends on Data Graphs & Security for the Internet of Things”
(Extended Version) #WF-IoT World Forum Internet of Things
Workshop on #Security and #Privacy for #InternetofThings and Cyber-Physical Systems #CPS
#Security #Toolbox #Attacks and #Countermeasures #STAC
#Security #KnowledgeGraphs #Ontologies
Speaker: Dr. Ghislain Atemezing(Research & Development Director, MONDECA, Paris, France) @gatemezing
Credits: Dr. Amelie Gyrard (Kno.e.sis, Wright State University, Ohio, USA)
Buying a Ferrari for your teenager? You may want to think twiceAl Zindiq
Data science teams have different levels of maturity and they need to be equipped with the right tools and infrastructure to make them more agile and ready. Here, I will be discussing a combination of open source tools and cloud managed services that can go hand-by-hand and grow with your data science teams needs as they mature.
Microsoft Data Science Technologies: Back Office EditionMark Tabladillo
Microsoft provides several technologies in and around SQL Server which can be used for casual to serious data science. This presentation provides an authoritative overview of five major options: SQL Server Analysis Services, Excel Add-in for SSAS, Semantic Search, Microsoft Azure Machine Learning, and F#. Also included are tips on working with Python and R. These technologies have been used by the presenter in various companies and industries. This presentation will emphasize the back office story for supporting big data processing.
Internet (Intelligence) of Things (IOT) with DrupalPrateek Jain
Talks about some of application in IOT space already and potential growth and impact IOT will have in next few years taking Nube as a case study.
Also talks about how to build your own end-to-end IOT solution using open hardware like Raspberry PI, Cloud Platform and Drupal.
Software Engineering Undergraduate Course Presentations
Software Engineering Principles
University of Vale do Itajaí
Univali
Incremental Tecnologia
English version
Dave Karow, Split. Powering Progressive Delivery With DataIT Arena
Dave has three decades of experience in developer tools, developer communities, and evangelizing sustainable software delivery practices. He has held programming, product management, and product marketing roles at Sun Microsystems, Gupta Technologies, Remedy Software, Marimba, Keynote Systems (Dynatrace), SOASTA, and BlazeMeter. Dave’s current passion is demystifying progressive delivery, especially the ways it enables better outcomes by removing constraints and building-in feedback loops.
Speech Overview:
We build pipelines to automate processes and minimize human toil. Have you applied this same approach to how you expose new features to users and measure the impact? Progressive delivery may be relatively new as a term, but the underlying practices of progressive experimentation (where features are gradually rolled out to users and statistical engines are used to detect impact during the rollout instead of after) are not.
We’ll discuss the layers, and benefits, from the foundation up.
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreDatabricks
Productionizing real-time ML models poses unique data engineering challenges for enterprises that are coming from batch-oriented analytics. Enterprise data, which has traditionally been centralized in data warehouses and optimized for BI use cases, must now be transformed into features that provide meaningful predictive signals to our ML models.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Microformats allow you to extend the limited semantics of HTML, thus allowing for a richer web of data. By embedding microformats into HTML it is possible to use the semantic meaning to extract unambiguous data from the web that can then be used in other applications. Microformats focus on how people are already publishing their data online. This session focuses on the more technical aspects of microformats, how they interact with other Semantic Web technologies such as GRDDL and SPARQL, along with a look at browser plugins to detect microformats and browser integration.
Co Speaker: Cheryl Biswas
Talk Description:
How about this: a blue team talk given by red teamers. But here’s our rationale - your best defence right now is a strategic offence. The rules of the game have changed and we need to get defence up to speed.
We’ll show you what the key elements are in a good defence strategy; what you can and need to be using to full advantage. We’ll talk about the new “buzzwords” and how they apply: visibility; patterns; big data. There’s a whole lotta data to wrangle, and you aren’t seeing the whole picture if you aren’t doing things right. Threat intel is about getting the big picture as it applies to you. You’ll learn the importance of context and prioritization so that you can manipulate intel feeds to do your bidding. And then we’ll take things further and talk about hunting the adversary, using an update on proven methodologies.
We’ll show you how to understand your data, correlate threats and pin point attacks. Attendees will leave with a new understanding of the resources they have on hand, and how to leverage those into an Adaptive Proactive Defense Strategy.
Working Software Over Comprehensive DocumentationAndrii Dzynia
Не один десяток раз каждый из нас видео этот пункт Agile манифеста. Кто на официальном сайте Agile Manifesto, кто в книгах или статьях, кто на тренингах или конференциях. Звучит правильно очевидно и просто, но на практике возникают некие сложности с его реализацией. Как определить какие документы писать нужно, а какие не стоит? Как поддерживать документы с наименьшими усилиями? От каких документов нужно отказаться или заменить на более простые решения? Что стоит документировать тестировщику, разработчику, бизнес-аналитику в Agile проектах, для того чтобы презентовать результаты своей работы. На все эти вопросы я постараюсь ответить в своем докладе, закрепляя примерами которые вы сможете попытаться применить на своих проектах.
In today's increasingly digitalised world, software defects are enormously expensive. In 2018, the Consortium for IT Software Quality reported that software defects cost the global economy $2.84 trillion dollars and affected more than 4 billion people. The average annual cost of software defects on Australian businesses is A$29 billion per year. Thus, failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters. Traditionally, software quality assurance activities like testing and code review are widely adopted to discover software defects in a software product. However, ultra-large-scale systems, such as, Google, can consist of more than two billion lines of code, so exhaustively reviewing and testing every single line of code isn't feasible with limited time and resources. This project aims to create technologies that enable software engineers to produce the highest quality software systems with the lowest operational costs. To achieve this, this project will invent an end-to-end explainable AI platform to (1) understand the nature of critical defects; (2) predict and locate defects; (3) explain and visualise the characteristics of defects; (4) suggest potential patches to automatically fix defects; (5) integrate such platform as a GitHub bot plugin.
Deep Learning - Hype, Reality and Applications in ManufacturingAdam Cook
This is the slide deck for the introductory webinar for our "Artificial Intelligence in Manufacturing" webinar and workshop series within the SME Virtual Network.
The video for this slide deck is located here: https://www.youtube.com/watch?v=orrVqOnFqds
To learn more about the SME Virtual Network and our events, please visit the following links:
https://www.facebook.com/smevirtual/
https://www.linkedin.com/company/smevirtual/
Training and deploying ML models with Google Cloud PlatformSotrender
Training and deploying ML models with Google Cloud Platform
In this presentation, Maciej presented some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. Maciej discussed also which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform
Presentation by Maciej Pieńkosz from Sotrender at Data Science Summit 2020
Internet (Intelligence) of Things (IOT) with DrupalPrateek Jain
Talks about some of application in IOT space already and potential growth and impact IOT will have in next few years taking Nube as a case study.
Also talks about how to build your own end-to-end IOT solution using open hardware like Raspberry PI, Cloud Platform and Drupal.
Software Engineering Undergraduate Course Presentations
Software Engineering Principles
University of Vale do Itajaí
Univali
Incremental Tecnologia
English version
Dave Karow, Split. Powering Progressive Delivery With DataIT Arena
Dave has three decades of experience in developer tools, developer communities, and evangelizing sustainable software delivery practices. He has held programming, product management, and product marketing roles at Sun Microsystems, Gupta Technologies, Remedy Software, Marimba, Keynote Systems (Dynatrace), SOASTA, and BlazeMeter. Dave’s current passion is demystifying progressive delivery, especially the ways it enables better outcomes by removing constraints and building-in feedback loops.
Speech Overview:
We build pipelines to automate processes and minimize human toil. Have you applied this same approach to how you expose new features to users and measure the impact? Progressive delivery may be relatively new as a term, but the underlying practices of progressive experimentation (where features are gradually rolled out to users and statistical engines are used to detect impact during the rollout instead of after) are not.
We’ll discuss the layers, and benefits, from the foundation up.
Accelerating the ML Lifecycle with an Enterprise-Grade Feature StoreDatabricks
Productionizing real-time ML models poses unique data engineering challenges for enterprises that are coming from batch-oriented analytics. Enterprise data, which has traditionally been centralized in data warehouses and optimized for BI use cases, must now be transformed into features that provide meaningful predictive signals to our ML models.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Microformats allow you to extend the limited semantics of HTML, thus allowing for a richer web of data. By embedding microformats into HTML it is possible to use the semantic meaning to extract unambiguous data from the web that can then be used in other applications. Microformats focus on how people are already publishing their data online. This session focuses on the more technical aspects of microformats, how they interact with other Semantic Web technologies such as GRDDL and SPARQL, along with a look at browser plugins to detect microformats and browser integration.
Co Speaker: Cheryl Biswas
Talk Description:
How about this: a blue team talk given by red teamers. But here’s our rationale - your best defence right now is a strategic offence. The rules of the game have changed and we need to get defence up to speed.
We’ll show you what the key elements are in a good defence strategy; what you can and need to be using to full advantage. We’ll talk about the new “buzzwords” and how they apply: visibility; patterns; big data. There’s a whole lotta data to wrangle, and you aren’t seeing the whole picture if you aren’t doing things right. Threat intel is about getting the big picture as it applies to you. You’ll learn the importance of context and prioritization so that you can manipulate intel feeds to do your bidding. And then we’ll take things further and talk about hunting the adversary, using an update on proven methodologies.
We’ll show you how to understand your data, correlate threats and pin point attacks. Attendees will leave with a new understanding of the resources they have on hand, and how to leverage those into an Adaptive Proactive Defense Strategy.
Working Software Over Comprehensive DocumentationAndrii Dzynia
Не один десяток раз каждый из нас видео этот пункт Agile манифеста. Кто на официальном сайте Agile Manifesto, кто в книгах или статьях, кто на тренингах или конференциях. Звучит правильно очевидно и просто, но на практике возникают некие сложности с его реализацией. Как определить какие документы писать нужно, а какие не стоит? Как поддерживать документы с наименьшими усилиями? От каких документов нужно отказаться или заменить на более простые решения? Что стоит документировать тестировщику, разработчику, бизнес-аналитику в Agile проектах, для того чтобы презентовать результаты своей работы. На все эти вопросы я постараюсь ответить в своем докладе, закрепляя примерами которые вы сможете попытаться применить на своих проектах.
In today's increasingly digitalised world, software defects are enormously expensive. In 2018, the Consortium for IT Software Quality reported that software defects cost the global economy $2.84 trillion dollars and affected more than 4 billion people. The average annual cost of software defects on Australian businesses is A$29 billion per year. Thus, failure to eliminate defects in safety-critical systems could result in serious injury to people, threats to life, death, and disasters. Traditionally, software quality assurance activities like testing and code review are widely adopted to discover software defects in a software product. However, ultra-large-scale systems, such as, Google, can consist of more than two billion lines of code, so exhaustively reviewing and testing every single line of code isn't feasible with limited time and resources. This project aims to create technologies that enable software engineers to produce the highest quality software systems with the lowest operational costs. To achieve this, this project will invent an end-to-end explainable AI platform to (1) understand the nature of critical defects; (2) predict and locate defects; (3) explain and visualise the characteristics of defects; (4) suggest potential patches to automatically fix defects; (5) integrate such platform as a GitHub bot plugin.
Deep Learning - Hype, Reality and Applications in ManufacturingAdam Cook
This is the slide deck for the introductory webinar for our "Artificial Intelligence in Manufacturing" webinar and workshop series within the SME Virtual Network.
The video for this slide deck is located here: https://www.youtube.com/watch?v=orrVqOnFqds
To learn more about the SME Virtual Network and our events, please visit the following links:
https://www.facebook.com/smevirtual/
https://www.linkedin.com/company/smevirtual/
Training and deploying ML models with Google Cloud PlatformSotrender
Training and deploying ML models with Google Cloud Platform
In this presentation, Maciej presented some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. Maciej discussed also which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform
Presentation by Maciej Pieńkosz from Sotrender at Data Science Summit 2020
State of the art as to content creation using AISotrender
How to use technological developments to give creativity back to the creative? Presenters with both digital marketing & technology perspectives, using original data to showcase their arguments.
Paid communication analysis on Facebook. Reach and cost estimations report.Sotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Brands image across the internet including social mediaSotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Audience Scan report based on social media dataSotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Sotrender is happy to present the 15th edition of Fanpage Trends UK – the first report analyzing brand communication on Facebook in the UK. We analysed reach, engagement, customer service, and content on Facebook in 11 integral industries. What are the biggest UK Facebook Pages? Which of them is the most successful at engaging fans and followers? Read the report and broaden your perspective on the social media landscape in the United Kingdom.
Insighty z social media - jak je wyciągnąć i dlaczego nie zawsze ma to sens?Sotrender
Media społecznościowe są kopalnią wiedzy. Dzięki analizie danych o użytkownikach i ich zachowaniach, firmy mogą w sposób świadomy modyfikować swoje produkty, strategie komunikacji a także reklamy i ich targetowanie. Tylko czy w każdym przypadku ma to sens?
Tego się nie da odzobaczyć - czego dowiedzieliśmy się o gustach Polaków przez...Sotrender
Badania treści i zachowań w mediach społecznościowych może być świetnym sposobem na zrozumienie preferencji, gustów i wartości młodych ludzi. Wnioski pomagają zrozumieć Generację Y i Generację Z, ich wartości, oczekiwania i upodobania. Ponadto można je stosować w tworzeniu strategii reklamowych i komunikacyjnych oraz przy projektowaniu marek. Dodatkowo pesymistom prezentacja dostarczy wiele dowodów na to, że świat nieuchronnie stacza się ku upadkowi.
- Wśród 40 najpopularniejszych polskich piosenek w social media w 2015 r. większość Top Ten to Gang Albanii; nawet jeśli go pominąć, to i tak wśród pozostałych znaczniej więcej mówi o wódzie i paleniu, niż o miłości.
Wszystkich 10 najpopularniejszych zdjęć na Instagramie to portrety młodych celebrytek: połowa jest wystylizowana, druga połowa - upozowana na naturalne
2 najbardziej nielubiane kanały na YouTube to Michał Wiśniewski i Kancelaria Premiera
Na Twitterze rządzą niepodzielnie trójca Robert Lewandowski, Prezydent Duda i Dawid Kwiatkowski
itd. itp.
Pokażemy miejsca i zjawiska, o których nawet badaczom się nie śniło i o których skali ciężko byłoby się dowiedzieć w inny sposób. Przedstawimy osoby i trendy, które dla kształtują Wasze dzieci i Waszych przyszłych klientów. W bonusie słuchacze - na własną odpowiedzialność - będą mieli niepowtarzalną szansę na poznanie życia i twórczości Honoraty Skarbek i Joanny Kuchty, a także profilu Ruchałbym jak dzika kuna w agreście.
How often should a fanpage post? We have prepared a data-driven analysis and tips on posting frequency. We also analyzed a difference between Brands and Media on Facebook.
How brands can benefit from Super Bowl by using social mediaSotrender
Study showing how small and medium businesses can use social media during Super Bowl game in order to raise their brand awareness and boost users’ engagement.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
11. Transfer learning - Computer Vision
Huge data
Data
representation
Objective
Photos of
cars
Data
representation
Classifier
Car brand
classification
ResNet - 50
12. Transfer learning - NLP
Huge data
Data
representation
Objective
Customer
reviews
Data
representation
Classifier
Sentiment
analysis
13. 1. Data programming instead of hand labelling
2. Functions instead of labelling guidelines
3. Multiple sources of labels instead one ground truth
Weak supervision
https://arxiv.org/pdf/1711.10160.pdf
17. Labeling functions vs Hand labelling
http://ai.stanford.edu/blog/weak-supervision/
● Much faster
● More flexible
● High coverage/Scalability
BUT...
● Less accurate
● Overlapping/Correlated
● Conflicting
● Noisy
18. Magic comes in: Generative model
http://cs231n.stanford.edu/slides/2018/cs231n_20
18_ds07.pdf
Solution:
● Generative model: P(L, Y) = P(L | Y)P(Y)
● Algorithm: asynchronous Gibbs sampling
Goal:
● Denosing labels
● Modelling accuracies and
correlations
Result:
The generative model -> re-weighted combination of
the labeling functions.
19. Majority voting vs Generative model
https://arxiv.org/pdf/1711.10160.pdf
20. Majority voting vs Generative model
https://arxiv.org/pdf/1711.10160.pdf
Condorcet’s Jury
Theorem
Not enough
conflicts
21. Generative vs Discriminative - reminder
http://maximustann.github.io/mach/2015/08/11/generative-vs-discriminative-models/
24. ● Scalability
● Quick training data preparation for big complex models
● Generalization beyond labelling functions
● Simple domain expertise leverage
Benefits of weak supervision system
25. Wrap up - Snorkel schema
https://arxiv.org/pdf/1711.10160.pdf
26. Data split:
Topic and Product classification examples
https://arxiv.org/pdf/1812.00417.pdf
Generative & Discriminative models gains of training on dev hand-labeled set
27. Majority voting vs Generative model
Topic and Product classification examples
https://arxiv.org/pdf/1812.00417.pdf
28. Benefits from non-serverable features
(real-time event classification)
Topic and Product classification examples
https://arxiv.org/pdf/1812.00417.pdf
Hand-labelling vs weak supervision
29. Weak supervision on images -
cross modal approach
https://arxiv.org/pdf/1903.11101.pdf
30. Weak supervision on images -
cross modal approach
https://arxiv.org/pdf/1903.11101.pdf
31. Weak supervision on images -
cross modal approach
https://arxiv.org/pdf/1903.11101.pdf
32. Weak supervision on images with domain
specific-primitives (DSP)
https://dawn.cs.stanford.edu/2017/09/14/coral/
33. Weak supervision on images with domain
specific-primitives (DSP) - functions
https://dawn.cs.stanford.edu/2017/09/14/coral/
34. Weak supervision on images with domain
specific-primitives (DSP)
https://dawn.cs.stanford.edu/2017/09/14/coral/