SlideShare a Scribd company logo
Daria Baidakova
Data labeling
on a large scale:
the missing pillar of AI
Labeled data: the missing pillar of AI
TensorFlow, PyTorch,
CatBoost, etc
AWS, MS Azure, Google
Cloud, Yandex Cloud, etc
???
Algorithms Hardware Data
AI
3
Intro
Daria Baidakova — Director of Educational Programs at
Toloka
Responsible for consulting and supporting Toloka
requesters in integrating crowdsourcing methodology in AI
projects. She also manages crowdsourcing courses at top
data analysis schools (Yandex School of Data Analysis, Y-
Data, etc) and organizes tutorials and hackathons for
crowdsourcing specialists.
Co-author of four hands-on tutorials on efficient
crowdsourcing (at WSDM'20, CVPR'20, SIGMOD'20,
WWW'21) and a co-organizer of the crowd science
workshop at NeurIPS'2020.
10 years experience in real industry
4
Car Sharing
Self Driving Car
Health
Cloud
Search Engine
Voice Assistant
Browser
Weather
Ad Tech
Taxi Food Delivery
E-commerce
Personalized
Stories Feed
Mail
Storage Auto.ru
Jobs
Realty
Movies
Tickets
Music
Maps
Navi
Transport
Auto
Infrastructure for AI worldwide
5
Infrastructure
for Search
Infrastructure
for Yandex
Infrastructure
for AI industry
ML production pipeline
6
Sample schematic
block
Control in production
Training
Retraining
Validating
7
Tolóka — an ancient tradition
Tolokers Requesters
Intelligent platform
Toloka — an open crowdsourcing platform
Tolokers distribution
9
* New regions and languages can be quickly allocated upon customer request
Argentina
Cote d'Ivoire
France India
Philippines
Tunisia
Morocco
Turkey
Ukraine
Brazil
Russia
Top countries with active tolokers*
Kenya
Pakistan
Venezuela
Egypt
Mexico
Nigeria
Peru
Portugal
Spain
USA
Vietnam
Top languages*
∙ English ∙ Spanish ∙ Arabic ∙ Portuguese ∙ Russian ∙ Ukrainian ∙ French
∙ German ∙ Italian ∙ Polish ∙ Latvian ∙ Bulgarian ∙ Czech ∙ Turkish ∙ Hindi
∙ Vietnamese ∙ Japanese ∙ Chinese ∙ Korean ∙ Indonesian
9+ million tolokers available across every time zone, 24 hours a day.
Use cases
11
Content
collection
Data
annotation
Business
decision making
Content collection
Audio collection
13
Say Hey, Alisa!
Say «Hey, Alisa!» Was the phrase correct?
Final result:
audio recording
Accept, if yes
Reject, if no
15
Spatial crowdsourcing
16
Update info about an organization
18
Pipeline for business verification
19
• Comes to point;
• Takes photos
GPS ok?
Original
photos?
Business
found? • Is there address
on the photo?
• Was the whole area
photographed?
Rewrite from photo:
Name, tel, website, working
hours, etc.
Computer vision
for company codes and
other fields
Is the photo ok
to be shown on maps?
Final result:
verified business
No, task declined
no
no
no
Decline
if
not
possible
yes
Task accepted; pay
21
Manual information collection on the Internet
22
Data enrichment
Data annotation
Real-life cases
24
Side-by-side
comparison
10 min $2.4
# 1 000 tasks
Text Classification
2 hrs $18
# 1 000 tasks
Phrase generation
for a chatbot
15 min $1
# 500 phrases on
the same topic
Object classification
15 min $1.2
# 1 000 photos
Object segmentation
5 hrs $3.6
# 1 000 objects
in 100 photos
Audio transcription
20 min $6
# 100 recordings
25 minute long
25
Search relevance results evaluation
26
Ads relevance evaluation
27
Objects segmentation for CV
28
Object segmentation for CV
Audio transcription
29
30
NLP-related tasks
31
Validation of auto-translate
32
Moderation of UGC
Moderation of content: human-in-the-loop
33
ML
No
Needed
confidence
level?
Training
and selection
Is the
content OK?
2nd line
Final result:
moderation verdict
Yes
Product decision making and
market research
36
Design Side-by-Side
37
Preferences
Quality on a large scale
Smart marketplace
Tolokers:
rate requesters
Requesters:
control the quality
of Tolokers
Platform:
general rating, rating of tasks
Managing process not people
41
Managing process not people
42
Selecting performers
â–º Language & region
â–º Age
â–º Gender
â–º Task specific skills
â–º Device filters
Managing process not people
43
Advanced training
& motivation
Selecting performers
â–º Educational projects
â–º Performers
examination
â–º Trained performers
â–º Quality based pricing
â–º Language & region
â–º Age
â–º Gender
â–º Task specific skills
â–º Device filters
Managing process not people
44
â–º Behavior control
â–º Anti-robotic tools
â–º Hidden control tasks
â–º Multiple users
consensus
â–º Verification
of assignments
Project-level controls
Advanced training
& motivation
Selecting performers
â–º Educational projects
â–º Performers
examination
â–º Trained performers
â–º Quality based pricing
â–º Language & region
â–º Age
â–º Gender
â–º Task specific skills
â–º Device filters
Managing process not people
45
â–º Behavoiur control
â–º Anti-robotic tools
â–º Hidden control tasks
â–º Multiple users
consensus
â–º Verification
of assignments
Project-level controls
Advanced training
& motivation
Selecting performers
â–º Educational projects
â–º Performers
examination
â–º Trained performers
â–º Quality based pricing
â–º Language & region
â–º Age
â–º Gender
â–º Task specific skills
â–º Device filters
â–º System-level ML
â–º Multiple aggregation
models
â–º Result-based
performers selection
â–º Real-time insights
Aggregation of results &
analysis
People management as an engineering task
Deal with the Crowd as with yet another computing cluster
Require minimal effort from people
People management as an engineering task
Deal with the Crowd as with yet another computing cluster
Require minimal effort from people
Free, powerful API
Toloka open source: examples and algorithms
https://github.com/Toloka/crowd-kit
https://github.com/Toloka/toloka-kit
How to learn
crowdsourcing?
https://toloka.ai/academy
Our goal is to help the
industry overcome the
bottleneck that data
constitutes today.
59
Crowd Science
Lecturers are experts in
crowdsourcing:
— ML developers & Crowd
Solution Architects from one of
the top IT companies in Europe
We share our methodology:
— based on years of research
and unique industry expertise
60
Practice
— Lecturers with real experience
developing products and working
with ML and AI tasks
— Tasks with real-life applications
— Case studies from Yandex, one
of Europe’s top IT companies
— The only crowdsourcing course
with practice based on real, large-
scale industry use-cases
https://www.coursera.org/learn/practical-crowdsourcing 61
Our success stories
Coursera online-course Courses in top Swiss
Universities, Tel-Aviv
University & Y-Data
https://toloka.ai/academy 62
What we offer
Talks,
lectures &
hands-on
tutorials
Research
grants
Student
plans for
data
labeling
projects
Ready-to-use
materials
Other ideas?
Let’s discuss!
Resources
63
Toloka Academy
Seminar series on crowdsourcing and beyond
Toloka Github
Coursera Course:
Practical Crowdsourcing for ML
Try Toloka
https://toloka.ai/
Real-time Demo
10 mins
Daria Baidakova
Director of Educational Programs
Thank you!
https://www.linkedin.com/in/dbaidakova/
https://toloka.ai/

More Related Content

Similar to Practical Crowdsourcing for ML at Scale

PHXTECH830
PHXTECH830PHXTECH830
PHXTECH830
Thinkful
 
Phxtech830
Phxtech830Phxtech830
Phxtech830
Thinkful
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Edward Curry
 
Advanced AI Applications In Enterprises
Advanced AI Applications In EnterprisesAdvanced AI Applications In Enterprises
Advanced AI Applications In Enterprises
AnandSRao1962
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020
Patrick Deglon
 
Activate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
Activate 2019 Opening Keynote, Will Hayes, CEO, LucidworksActivate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
Activate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
Lucidworks
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
Trey Grainger
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
SoftServe
 
Artificial Intelligence and Antitrust (Hal Varian)
Artificial Intelligence and Antitrust (Hal Varian)Artificial Intelligence and Antitrust (Hal Varian)
Artificial Intelligence and Antitrust (Hal Varian)
FSR Communications and Media
 
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial IntelligenceAI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
Ricky Wong
 
Projects at TietoEvry.pdf
Projects at TietoEvry.pdfProjects at TietoEvry.pdf
Projects at TietoEvry.pdf
Sanjay Talukdar
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
AI pitch SSideri
 AI pitch SSideri  AI pitch SSideri
AI pitch SSideri
Uni Systems S.M.S.A.
 
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
Rene Gayer
 
Machine Learning Project Lifecycle
Machine Learning Project LifecycleMachine Learning Project Lifecycle
Machine Learning Project Lifecycle
Abdelhak MAHMOUDI
 
Qurater for Sales Leads
Qurater for Sales LeadsQurater for Sales Leads
Qurater for Sales Leads
Qurater
 
Notes from the field on customizing your AI using Cognitive Services
Notes from the field on customizing your AI using Cognitive ServicesNotes from the field on customizing your AI using Cognitive Services
Notes from the field on customizing your AI using Cognitive Services
Microsoft Tech Community
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
Julien SIMON
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
Sri Ambati
 

Similar to Practical Crowdsourcing for ML at Scale (20)

PHXTECH830
PHXTECH830PHXTECH830
PHXTECH830
 
Phxtech830
Phxtech830Phxtech830
Phxtech830
 
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage DataCollaborative Data Management: How Crowdsourcing Can Help To Manage Data
Collaborative Data Management: How Crowdsourcing Can Help To Manage Data
 
Advanced AI Applications In Enterprises
Advanced AI Applications In EnterprisesAdvanced AI Applications In Enterprises
Advanced AI Applications In Enterprises
 
Intelligence Data Day 2020
Intelligence Data Day 2020Intelligence Data Day 2020
Intelligence Data Day 2020
 
Activate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
Activate 2019 Opening Keynote, Will Hayes, CEO, LucidworksActivate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
Activate 2019 Opening Keynote, Will Hayes, CEO, Lucidworks
 
AI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge ManagementAI, Search, and the Disruption of Knowledge Management
AI, Search, and the Disruption of Knowledge Management
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Artificial Intelligence and Antitrust (Hal Varian)
Artificial Intelligence and Antitrust (Hal Varian)Artificial Intelligence and Antitrust (Hal Varian)
Artificial Intelligence and Antitrust (Hal Varian)
 
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial IntelligenceAI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
AI Happy Hour - Dr. Kai-Fu Lee - The Golden age of Artificial Intelligence
 
Projects at TietoEvry.pdf
Projects at TietoEvry.pdfProjects at TietoEvry.pdf
Projects at TietoEvry.pdf
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
AI pitch SSideri
 AI pitch SSideri  AI pitch SSideri
AI pitch SSideri
 
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
(Microsoft) Dynamics (365) Generation X,Y,Z – it was never easier (harder) to...
 
Machine Learning Project Lifecycle
Machine Learning Project LifecycleMachine Learning Project Lifecycle
Machine Learning Project Lifecycle
 
Qurater for Sales Leads
Qurater for Sales LeadsQurater for Sales Leads
Qurater for Sales Leads
 
Notes from the field on customizing your AI using Cognitive Services
Notes from the field on customizing your AI using Cognitive ServicesNotes from the field on customizing your AI using Cognitive Services
Notes from the field on customizing your AI using Cognitive Services
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
 

More from Bill Liu

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Bill Liu
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Bill Liu
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Bill Liu
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Bill Liu
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Bill Liu
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Bill Liu
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Bill Liu
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Bill Liu
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Bill Liu
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Bill Liu
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Bill Liu
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Bill Liu
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Bill Liu
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Bill Liu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Auto visualization and viml
Auto visualization and vimlAuto visualization and viml
Auto visualization and viml
Bill Liu
 

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Auto visualization and viml
Auto visualization and vimlAuto visualization and viml
Auto visualization and viml
 

Recently uploaded

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

Practical Crowdsourcing for ML at Scale

  • 1. Daria Baidakova Data labeling on a large scale: the missing pillar of AI
  • 2. Labeled data: the missing pillar of AI TensorFlow, PyTorch, CatBoost, etc AWS, MS Azure, Google Cloud, Yandex Cloud, etc ??? Algorithms Hardware Data AI
  • 3. 3 Intro Daria Baidakova — Director of Educational Programs at Toloka Responsible for consulting and supporting Toloka requesters in integrating crowdsourcing methodology in AI projects. She also manages crowdsourcing courses at top data analysis schools (Yandex School of Data Analysis, Y- Data, etc) and organizes tutorials and hackathons for crowdsourcing specialists. Co-author of four hands-on tutorials on efficient crowdsourcing (at WSDM'20, CVPR'20, SIGMOD'20, WWW'21) and a co-organizer of the crowd science workshop at NeurIPS'2020.
  • 4. 10 years experience in real industry 4 Car Sharing Self Driving Car Health Cloud Search Engine Voice Assistant Browser Weather Ad Tech Taxi Food Delivery E-commerce Personalized Stories Feed Mail Storage Auto.ru Jobs Realty Movies Tickets Music Maps Navi Transport Auto
  • 5. Infrastructure for AI worldwide 5 Infrastructure for Search Infrastructure for Yandex Infrastructure for AI industry
  • 6. ML production pipeline 6 Sample schematic block Control in production Training Retraining Validating
  • 7. 7 Tolóka — an ancient tradition
  • 8. Tolokers Requesters Intelligent platform Toloka — an open crowdsourcing platform
  • 9. Tolokers distribution 9 * New regions and languages can be quickly allocated upon customer request Argentina Cote d'Ivoire France India Philippines Tunisia Morocco Turkey Ukraine Brazil Russia Top countries with active tolokers* Kenya Pakistan Venezuela Egypt Mexico Nigeria Peru Portugal Spain USA Vietnam Top languages* ∙ English ∙ Spanish ∙ Arabic ∙ Portuguese ∙ Russian ∙ Ukrainian ∙ French ∙ German ∙ Italian ∙ Polish ∙ Latvian ∙ Bulgarian ∙ Czech ∙ Turkish ∙ Hindi ∙ Vietnamese ∙ Japanese ∙ Chinese ∙ Korean ∙ Indonesian 9+ million tolokers available across every time zone, 24 hours a day.
  • 14. Say Hey, Alisa! Say «Hey, Alisa!» Was the phrase correct? Final result: audio recording Accept, if yes Reject, if no
  • 16. 16 Update info about an organization
  • 17. 18
  • 18. Pipeline for business verification 19 • Comes to point; • Takes photos GPS ok? Original photos? Business found? • Is there address on the photo? • Was the whole area photographed? Rewrite from photo: Name, tel, website, working hours, etc. Computer vision for company codes and other fields Is the photo ok to be shown on maps? Final result: verified business No, task declined no no no Decline if not possible yes Task accepted; pay
  • 22. Real-life cases 24 Side-by-side comparison 10 min $2.4 # 1 000 tasks Text Classification 2 hrs $18 # 1 000 tasks Phrase generation for a chatbot 15 min $1 # 500 phrases on the same topic Object classification 15 min $1.2 # 1 000 photos Object segmentation 5 hrs $3.6 # 1 000 objects in 100 photos Audio transcription 20 min $6 # 100 recordings 25 minute long
  • 31. Moderation of content: human-in-the-loop 33 ML No Needed confidence level? Training and selection Is the content OK? 2nd line Final result: moderation verdict Yes
  • 32. Product decision making and market research
  • 35. Quality on a large scale
  • 36. Smart marketplace Tolokers: rate requesters Requesters: control the quality of Tolokers Platform: general rating, rating of tasks
  • 37. Managing process not people 41
  • 38. Managing process not people 42 Selecting performers â–º Language & region â–º Age â–º Gender â–º Task specific skills â–º Device filters
  • 39. Managing process not people 43 Advanced training & motivation Selecting performers â–º Educational projects â–º Performers examination â–º Trained performers â–º Quality based pricing â–º Language & region â–º Age â–º Gender â–º Task specific skills â–º Device filters
  • 40. Managing process not people 44 â–º Behavior control â–º Anti-robotic tools â–º Hidden control tasks â–º Multiple users consensus â–º Verification of assignments Project-level controls Advanced training & motivation Selecting performers â–º Educational projects â–º Performers examination â–º Trained performers â–º Quality based pricing â–º Language & region â–º Age â–º Gender â–º Task specific skills â–º Device filters
  • 41. Managing process not people 45 â–º Behavoiur control â–º Anti-robotic tools â–º Hidden control tasks â–º Multiple users consensus â–º Verification of assignments Project-level controls Advanced training & motivation Selecting performers â–º Educational projects â–º Performers examination â–º Trained performers â–º Quality based pricing â–º Language & region â–º Age â–º Gender â–º Task specific skills â–º Device filters â–º System-level ML â–º Multiple aggregation models â–º Result-based performers selection â–º Real-time insights Aggregation of results & analysis
  • 42. People management as an engineering task Deal with the Crowd as with yet another computing cluster Require minimal effort from people
  • 43. People management as an engineering task Deal with the Crowd as with yet another computing cluster Require minimal effort from people Free, powerful API
  • 44. Toloka open source: examples and algorithms https://github.com/Toloka/crowd-kit https://github.com/Toloka/toloka-kit
  • 46. https://toloka.ai/academy Our goal is to help the industry overcome the bottleneck that data constitutes today.
  • 47. 59 Crowd Science Lecturers are experts in crowdsourcing: — ML developers & Crowd Solution Architects from one of the top IT companies in Europe We share our methodology: — based on years of research and unique industry expertise
  • 48. 60 Practice — Lecturers with real experience developing products and working with ML and AI tasks — Tasks with real-life applications — Case studies from Yandex, one of Europe’s top IT companies — The only crowdsourcing course with practice based on real, large- scale industry use-cases
  • 49. https://www.coursera.org/learn/practical-crowdsourcing 61 Our success stories Coursera online-course Courses in top Swiss Universities, Tel-Aviv University & Y-Data
  • 50. https://toloka.ai/academy 62 What we offer Talks, lectures & hands-on tutorials Research grants Student plans for data labeling projects Ready-to-use materials Other ideas? Let’s discuss!
  • 51. Resources 63 Toloka Academy Seminar series on crowdsourcing and beyond Toloka Github Coursera Course: Practical Crowdsourcing for ML Try Toloka https://toloka.ai/
  • 53. Daria Baidakova Director of Educational Programs Thank you! https://www.linkedin.com/in/dbaidakova/ https://toloka.ai/