SlideShare a Scribd company logo
Confidential Material – Chegg Inc. © 2005 - 2016. All Rights Reserved.© 2005 – 2017 by Chegg Inc. All Rights Reserved. 1
Natural Language Comprehension: Human Machine Collaboration.
Sanghamitra	Deb,	Data	Scientist	
Gabriela	Brown,	Summer	Intern
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.2
Chegg
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.3
What is Chegg Tutors?
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.4
Unstructured data in Business
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.5
Dark Data at Chegg
Chats	between	tutors	and	students
Chegg Study	Q&A
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.6
Bringing light to Dark Data
DeepDive and	snorkel	processes	such	documents	from	
public	and	dark	web	to	extract	evidential	data,	such	as	
names,	addresses,	phone	numbers,	job	types,	job	
requirements,	information	about	rates	of	service,	etc.
Wikipedia	extractions
Detecting	Online	Sex	Trafficking
Professor	Chris	Re
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.7
Student looking for tutors
I	need	a	10	page	essay	
written	on	the	
deforestation	of	the	
amazon	rainforest.	must	
have	7	resources.
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.8
Students intents: Fraud
• Do my homework
• Take online quiz for me
• Do my scheduled take home exam
Universities	typically	have	strict	honor	policies,	stating	that	your	homework,	
exams,	take	home	etc should	be	completed	by	the	student	without	any	
external	help.		
A	small	number	of	students	come	to	platform	to	get	their	homework	done	
or	ask	someone	to	take	their	exam	for	them.	This	is	a	strict	violation	of	
honor	code.
Examples
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.9
Typical NLP Machine Learning Flow
High	Performing	Machine	Learning	Models	
could	require	100,000	labelled	data	!!
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.10
Traditional Feature Engineering
Winning	solution!!
Feature		Engineering
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.11
Generating a training set
• Human	reading	and	labeling
• Several	hundreds	of	expert	hours
• Difficult	to	scale	with	evolving	
business	questions
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.12
The snorkel pipeline
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.13
Human Machine Collaboration
Knowledge	
transfer
What	is	
important	to	
product	and	
business
Language,	
business	needs	
and	teams	
evolve.	
Data	Scientist
Product/Businesss
SME
Iterate
Knowledge	
transfer
What	is	
important	to	
product	and	
business
SME
Data	Scientist
Product/Businesss
• Create	Filters	
• Create	rules
• Redefine	Filters	
• Redefine	rulesReplaces	manual	generation	of	labelled	data
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.14
Automated Features
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.15
Creating Filters: Candidate Extraction
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.16
Creating Filters: Candidate Extraction
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.17
Observing the candidates
Humans/SME’s	look	at	~100-200	of	them	and	label	them.
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.18
Creating Rules
v
I	will	pay	someone	to	write	my	essay.
Reference	to	the	tutor		+		verb	followed	by	“my”	
This	is	an	honor	code	violation✓
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.19
What do the rule functions look like?
Several	tens	of	rules	create	the	training	set
The	rules	are	judged	based	on	the	labels	provided	
By	humans	or	SME’s
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.20
Developing the training set: one rule
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training	set 1:	Class	1
0:	unlabelled data
-1:	Class	2
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.21
Developing the training set: four rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training	set 1:	Class	1
0:	unlabelled data
-1:	Class	2
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.22
Developing the training set: eight rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training	set 1:	Class	1
0:	unlabelled data
-1:	Class	2
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.23
Developing the training set: one rule
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training	set 1:	Class	1
0:	unlabelled data
-1:	Class	2
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.24
Developing the training set: twenty rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training	set 1:	Class	1
0:	unlabelled data
-1:	Class	2
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.25
Performance
Evaluation	Metrics
Positive	accuracy 68.3%
Negative	accuracy 90.7%
Precision 71.8%
Recall 68.3%
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.26
Production: Iterations
• Snorkel	codes	run	on	the	opportunities	sent	the	day	before,	humans	check	the	list	and	update	a	file	with	
real		honor	code	violations.		
• After	doing	unsupervised	learning	(topic	modeling,	word2vec)	on	the	positive	and	negative	HCV’s	from	
human	generated	data	the	rules	are	changed	to	improve	positive	accuracy.
In	dynamic	two	sided	
market	places	language	
and	behavior	changes	
continuously,	hence	
having	iterations	every	
3-4	months		keeps	the	
model	fresh
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.27
Generalization: Matching Problem for Chegg Tutors
• Feature	Generation	for	student	tutor	pairs.		
• Chegg tutors	is	a	two	sided	market	place	with	students	and	tutors	
being	paired	based	on	their	overlapping	characteristics.
• Generating	features	is	an	important	part	of		creating	this	
recommendation	system.	Snorkel	helps	generate	key	phrases	
associated	with	student-tutor	pairs.
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.28
Behind training set generation: PGM’s
https://arxiv.org/pdf/1605.07723.pdf
Model	the	rules	as	
independent	similar	to	
Naïve	Bayes
Consider	interdependencies	between	the	rules.
Similar fix reinforce
exclude
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.29
Noisy sources of truth
credit:
https://hazyresearch.github.io/snorkel/
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.30
Generalization
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.31
Thank you
sdeb@chegg.com
@sangha_deb
Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.32
Stanford NLP & Tools
Intent	of	Honor	Code	ViolationOpportunities
• Are	students	and	tutors	having	
classes	offsite?
• Are	tutors	comitting fraud?
• Are	students/tutors	using	offensive	
language?
• Do	students	want	lessons	
immediately	or	they	are	willing	to	
wait?	…...
Other	business	questions
Other	datasets:	Chat

More Related Content

Similar to Natural Language Comprehension: Human Machine Collaboration.

Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitiveBalancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
Daniel Walsh
 
Better Resumes For Applying Online
Better Resumes For Applying Online Better Resumes For Applying Online
Better Resumes For Applying Online
Denis Curtin
 
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
Denis Curtin
 
Select a Research Brand Name
Select a Research Brand NameSelect a Research Brand Name
Select a Research Brand Name
Nader Ale Ebrahim
 
Trends and Tools in Training for Business 2017
Trends and Tools in Training for Business 2017Trends and Tools in Training for Business 2017
Trends and Tools in Training for Business 2017
Allen Partridge
 
The HR Technology Market: Trends and Disruptions for 2018
The HR Technology Market:  Trends and Disruptions for 2018The HR Technology Market:  Trends and Disruptions for 2018
The HR Technology Market: Trends and Disruptions for 2018
Josh Bersin
 
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting PartnersGPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
Amazon Web Services
 
Analyzing User Traffic & Expert’s Behavior on Teachable
Analyzing User Traffic & Expert’s Behavior on TeachableAnalyzing User Traffic & Expert’s Behavior on Teachable
Analyzing User Traffic & Expert’s Behavior on Teachable
SagarKumar0812
 
Chegg India guideline presentation
Chegg India guideline presentation Chegg India guideline presentation
Chegg India guideline presentation
Vikas Barnwal
 
Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done"
EBG Consulting, Inc.
 
Getting Started in Tech (June 19th, Santa Monica)
Getting Started in Tech (June 19th, Santa Monica)Getting Started in Tech (June 19th, Santa Monica)
Getting Started in Tech (June 19th, Santa Monica)
Thinkful
 
Report on web development
Report on web developmentReport on web development
Report on web development
AJEETKUMAR932614
 
Ai revolution for human capital for individuals 2nd feb 2018
Ai revolution for human capital for individuals 2nd feb 2018Ai revolution for human capital for individuals 2nd feb 2018
Ai revolution for human capital for individuals 2nd feb 2018
Liew Wei Da Andrew
 
Delivering balanced solutions by nu cognitive for pints with pdx product mana...
Delivering balanced solutions by nu cognitive for pints with pdx product mana...Delivering balanced solutions by nu cognitive for pints with pdx product mana...
Delivering balanced solutions by nu cognitive for pints with pdx product mana...
Daniel Walsh
 
Carmen hudson 1 pager - sourcing is about more than boolean
Carmen hudson 1 pager - sourcing is about more than booleanCarmen hudson 1 pager - sourcing is about more than boolean
Carmen hudson 1 pager - sourcing is about more than boolean
Talent42
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
Neo4j
 
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free appsSalesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
Buyan Thyagarajan
 
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Edureka!
 
Confessions of an HR Executive
Confessions of an HR ExecutiveConfessions of an HR Executive
Confessions of an HR Executive
hdonbrown
 
Phil saville cdm2007
Phil saville cdm2007Phil saville cdm2007
Phil saville cdm2007
psaville
 

Similar to Natural Language Comprehension: Human Machine Collaboration. (20)

Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitiveBalancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
Balancing Business + Usage + Technology Workshop by Daniel Walsh nuCognitive
 
Better Resumes For Applying Online
Better Resumes For Applying Online Better Resumes For Applying Online
Better Resumes For Applying Online
 
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
Optimize Your Resume (Will County) For Applicant Tracking Systems 2017
 
Select a Research Brand Name
Select a Research Brand NameSelect a Research Brand Name
Select a Research Brand Name
 
Trends and Tools in Training for Business 2017
Trends and Tools in Training for Business 2017Trends and Tools in Training for Business 2017
Trends and Tools in Training for Business 2017
 
The HR Technology Market: Trends and Disruptions for 2018
The HR Technology Market:  Trends and Disruptions for 2018The HR Technology Market:  Trends and Disruptions for 2018
The HR Technology Market: Trends and Disruptions for 2018
 
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting PartnersGPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
 
Analyzing User Traffic & Expert’s Behavior on Teachable
Analyzing User Traffic & Expert’s Behavior on TeachableAnalyzing User Traffic & Expert’s Behavior on Teachable
Analyzing User Traffic & Expert’s Behavior on Teachable
 
Chegg India guideline presentation
Chegg India guideline presentation Chegg India guideline presentation
Chegg India guideline presentation
 
Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done"
 
Getting Started in Tech (June 19th, Santa Monica)
Getting Started in Tech (June 19th, Santa Monica)Getting Started in Tech (June 19th, Santa Monica)
Getting Started in Tech (June 19th, Santa Monica)
 
Report on web development
Report on web developmentReport on web development
Report on web development
 
Ai revolution for human capital for individuals 2nd feb 2018
Ai revolution for human capital for individuals 2nd feb 2018Ai revolution for human capital for individuals 2nd feb 2018
Ai revolution for human capital for individuals 2nd feb 2018
 
Delivering balanced solutions by nu cognitive for pints with pdx product mana...
Delivering balanced solutions by nu cognitive for pints with pdx product mana...Delivering balanced solutions by nu cognitive for pints with pdx product mana...
Delivering balanced solutions by nu cognitive for pints with pdx product mana...
 
Carmen hudson 1 pager - sourcing is about more than boolean
Carmen hudson 1 pager - sourcing is about more than booleanCarmen hudson 1 pager - sourcing is about more than boolean
Carmen hudson 1 pager - sourcing is about more than boolean
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free appsSalesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
Salesforce Spring 17 features for Higher Ed, HEDA best practices and Free apps
 
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
 
Confessions of an HR Executive
Confessions of an HR ExecutiveConfessions of an HR Executive
Confessions of an HR Executive
 
Phil saville cdm2007
Phil saville cdm2007Phil saville cdm2007
Phil saville cdm2007
 

More from Sanghamitra Deb

odsc_2023.pdf
odsc_2023.pdfodsc_2023.pdf
odsc_2023.pdf
Sanghamitra Deb
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
Sanghamitra Deb
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
Sanghamitra Deb
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic Modeling
Sanghamitra Deb
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
Sanghamitra Deb
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
Sanghamitra Deb
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
Sanghamitra Deb
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
Sanghamitra Deb
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
Sanghamitra Deb
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Data day2017
Data day2017Data day2017
Data day2017
Sanghamitra Deb
 
Extracting knowledgebase from text
Extracting knowledgebase from textExtracting knowledgebase from text
Extracting knowledgebase from text
Sanghamitra Deb
 
Extracting medical attributes and finding relations
Extracting medical attributes and finding relationsExtracting medical attributes and finding relations
Extracting medical attributes and finding relations
Sanghamitra Deb
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
Sanghamitra Deb
 
Understanding Product Attributes from Reviews
Understanding Product Attributes from ReviewsUnderstanding Product Attributes from Reviews
Understanding Product Attributes from Reviews
Sanghamitra Deb
 

More from Sanghamitra Deb (15)

odsc_2023.pdf
odsc_2023.pdfodsc_2023.pdf
odsc_2023.pdf
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
Intro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic ModelingIntro to NLP: Text Categorization and Topic Modeling
Intro to NLP: Text Categorization and Topic Modeling
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Data day2017
Data day2017Data day2017
Data day2017
 
Extracting knowledgebase from text
Extracting knowledgebase from textExtracting knowledgebase from text
Extracting knowledgebase from text
 
Extracting medical attributes and finding relations
Extracting medical attributes and finding relationsExtracting medical attributes and finding relations
Extracting medical attributes and finding relations
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Understanding Product Attributes from Reviews
Understanding Product Attributes from ReviewsUnderstanding Product Attributes from Reviews
Understanding Product Attributes from Reviews
 

Recently uploaded

Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
dsnow9802
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
SocMediaFin - Joyce Sullivan
 
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
NWEXAM
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
2zjra9bn
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
Alliance Jobs
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
2zjra9bn
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
Bruce Bennett
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
Ghh
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Sreenivas702647
 
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
taqyea
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
kakomaeric00
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
cahgading001
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
GabrielleSinaga
 
thyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatialthyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatial
Aditya Raghav
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
Thomas GIRARD BDes
 
Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
Bruce Bennett
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
Bruce Bennett
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
FauzanHarits1
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
Ghh
 
Tape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdfTape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdf
KateRobinson68
 

Recently uploaded (20)

Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
Jill Pizzola's Tenure as Senior Talent Acquisition Partner at THOMSON REUTERS...
 
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdfSwitching Careers Slides - JoyceMSullivan SocMediaFin -  2024Jun11.pdf
Switching Careers Slides - JoyceMSullivan SocMediaFin - 2024Jun11.pdf
 
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
How to Prepare for Fortinet FCP_FAC_AD-6.5 Certification?
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
 
5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf5 Common Mistakes to Avoid During the Job Application Process.pdf
5 Common Mistakes to Avoid During the Job Application Process.pdf
 
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
官方认证美国旧金山州立大学毕业证学位证书案例原版一模一样
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
 
Leave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employeesLeave-rules.ppt CCS leave rules 1972 for central govt employees
Leave-rules.ppt CCS leave rules 1972 for central govt employees
 
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
 
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAANBUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
BUKU PENJAGAAN BUKU PENJAGAAN BUKU PENJAGAAN
 
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
Gabrielle M. A. Sinaga Portfolio, Film Student (2024)
 
thyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatialthyroid case presentation.pptx Kamala's Lakshaman palatial
thyroid case presentation.pptx Kamala's Lakshaman palatial
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
 
Learnings from Successful Jobs Searchers
Learnings from Successful Jobs SearchersLearnings from Successful Jobs Searchers
Learnings from Successful Jobs Searchers
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
 
Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
 
Tape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdfTape Measure Training & Practice Assessments.pdf
Tape Measure Training & Practice Assessments.pdf
 

Natural Language Comprehension: Human Machine Collaboration.

  • 1. Confidential Material – Chegg Inc. © 2005 - 2016. All Rights Reserved.© 2005 – 2017 by Chegg Inc. All Rights Reserved. 1 Natural Language Comprehension: Human Machine Collaboration. Sanghamitra Deb, Data Scientist Gabriela Brown, Summer Intern
  • 2. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.2 Chegg
  • 3. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.3 What is Chegg Tutors?
  • 4. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.4 Unstructured data in Business
  • 5. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.5 Dark Data at Chegg Chats between tutors and students Chegg Study Q&A
  • 6. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.6 Bringing light to Dark Data DeepDive and snorkel processes such documents from public and dark web to extract evidential data, such as names, addresses, phone numbers, job types, job requirements, information about rates of service, etc. Wikipedia extractions Detecting Online Sex Trafficking Professor Chris Re
  • 7. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.7 Student looking for tutors I need a 10 page essay written on the deforestation of the amazon rainforest. must have 7 resources.
  • 8. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.8 Students intents: Fraud • Do my homework • Take online quiz for me • Do my scheduled take home exam Universities typically have strict honor policies, stating that your homework, exams, take home etc should be completed by the student without any external help. A small number of students come to platform to get their homework done or ask someone to take their exam for them. This is a strict violation of honor code. Examples
  • 9. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.9 Typical NLP Machine Learning Flow High Performing Machine Learning Models could require 100,000 labelled data !!
  • 10. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.10 Traditional Feature Engineering Winning solution!! Feature Engineering
  • 11. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.11 Generating a training set • Human reading and labeling • Several hundreds of expert hours • Difficult to scale with evolving business questions
  • 12. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.12 The snorkel pipeline
  • 13. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.13 Human Machine Collaboration Knowledge transfer What is important to product and business Language, business needs and teams evolve. Data Scientist Product/Businesss SME Iterate Knowledge transfer What is important to product and business SME Data Scientist Product/Businesss • Create Filters • Create rules • Redefine Filters • Redefine rulesReplaces manual generation of labelled data
  • 14. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.14 Automated Features
  • 15. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.15 Creating Filters: Candidate Extraction
  • 16. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.16 Creating Filters: Candidate Extraction
  • 17. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.17 Observing the candidates Humans/SME’s look at ~100-200 of them and label them.
  • 18. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.18 Creating Rules v I will pay someone to write my essay. Reference to the tutor + verb followed by “my” This is an honor code violation✓
  • 19. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.19 What do the rule functions look like? Several tens of rules create the training set The rules are judged based on the labels provided By humans or SME’s
  • 20. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.20 Developing the training set: one rule 0 20 40 60 80 100 120 140 160 180 200 1 0 -1 Training set 1: Class 1 0: unlabelled data -1: Class 2
  • 21. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.21 Developing the training set: four rules 0 20 40 60 80 100 120 140 160 180 200 1 0 -1 Training set 1: Class 1 0: unlabelled data -1: Class 2
  • 22. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.22 Developing the training set: eight rules 0 20 40 60 80 100 120 140 160 180 200 1 0 -1 Training set 1: Class 1 0: unlabelled data -1: Class 2
  • 23. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.23 Developing the training set: one rule 0 20 40 60 80 100 120 140 160 180 200 1 0 -1 Training set 1: Class 1 0: unlabelled data -1: Class 2
  • 24. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.24 Developing the training set: twenty rules 0 20 40 60 80 100 120 140 160 180 200 1 0 -1 Training set 1: Class 1 0: unlabelled data -1: Class 2
  • 25. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.25 Performance Evaluation Metrics Positive accuracy 68.3% Negative accuracy 90.7% Precision 71.8% Recall 68.3%
  • 26. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.26 Production: Iterations • Snorkel codes run on the opportunities sent the day before, humans check the list and update a file with real honor code violations. • After doing unsupervised learning (topic modeling, word2vec) on the positive and negative HCV’s from human generated data the rules are changed to improve positive accuracy. In dynamic two sided market places language and behavior changes continuously, hence having iterations every 3-4 months keeps the model fresh
  • 27. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.27 Generalization: Matching Problem for Chegg Tutors • Feature Generation for student tutor pairs. • Chegg tutors is a two sided market place with students and tutors being paired based on their overlapping characteristics. • Generating features is an important part of creating this recommendation system. Snorkel helps generate key phrases associated with student-tutor pairs.
  • 28. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.28 Behind training set generation: PGM’s https://arxiv.org/pdf/1605.07723.pdf Model the rules as independent similar to Naïve Bayes Consider interdependencies between the rules. Similar fix reinforce exclude
  • 29. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.29 Noisy sources of truth credit: https://hazyresearch.github.io/snorkel/
  • 30. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.30 Generalization
  • 31. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.31 Thank you sdeb@chegg.com @sangha_deb
  • 32. Example Slide Chegg Inc. © 2005 – 2017. All Rights Reserved.32 Stanford NLP & Tools Intent of Honor Code ViolationOpportunities • Are students and tutors having classes offsite? • Are tutors comitting fraud? • Are students/tutors using offensive language? • Do students want lessons immediately or they are willing to wait? …... Other business questions Other datasets: Chat