Natural Language Comprehension: Human Machine Collaboration.

Confidential Material – Chegg Inc. © 2005 - 2016. All Rights Reserved.© 2005 – 2017 by Chegg Inc. All Rights Reserved. 1
Natural Language Comprehension: Human Machine Collaboration.
Sanghamitra Deb, Data Scientist
Gabriela Brown, Summer Intern

Example Slide
Chegg Inc. © 2005 – 2017. All Rights Reserved.2
Chegg

Example Slide
What is Chegg Tutors?

Example Slide
Unstructured data in Business

Example Slide
Dark Data at Chegg
Chats between tutors and students
Chegg Study Q&A

Example Slide
Bringing light to Dark Data
DeepDive and snorkel processes such documents from
public and dark web to extract evidential data, such as
names, addresses, phone numbers, job types, job
requirements, information about rates of service, etc.
Wikipedia extractions
Detecting Online Sex Trafficking
Professor Chris Re

Example Slide
Student looking for tutors
I need a 10 page essay
written on the
deforestation of the
amazon rainforest. must
have 7 resources.

Example Slide
Students intents: Fraud
• Do my homework
• Take online quiz for me
• Do my scheduled take home exam
Universities typically have strict honor policies, stating that your homework,
exams, take home etc should be completed by the student without any
external help.
A small number of students come to platform to get their homework done
or ask someone to take their exam for them. This is a strict violation of
honor code.
Examples

Example Slide
Typical NLP Machine Learning Flow
High Performing Machine Learning Models
could require 100,000 labelled data !!

Example Slide
Traditional Feature Engineering
Winning solution!!
Feature Engineering

Example Slide
Generating a training set
• Human reading and labeling
• Several hundreds of expert hours
• Difficult to scale with evolving
business questions

Example Slide
The snorkel pipeline

Example Slide
Human Machine Collaboration
Knowledge
transfer
What is
important to
product and
business
Language,
business needs
and teams
evolve.
Data Scientist
Product/Businesss
SME
Iterate
Knowledge
transfer
What is
important to
product and
business
SME
Data Scientist
Product/Businesss
• Create Filters
• Create rules
• Redefine Filters
• Redefine rulesReplaces manual generation of labelled data

Example Slide
Automated Features

Example Slide
Creating Filters: Candidate Extraction

Example Slide
Observing the candidates
Humans/SME’s look at ~100-200 of them and label them.

Example Slide
Creating Rules
v
I will pay someone to write my essay.
Reference to the tutor + verb followed by “my”
This is an honor code violation✓

Example Slide
What do the rule functions look like?
Several tens of rules create the training set
The rules are judged based on the labels provided
By humans or SME’s

Example Slide
Developing the training set: one rule
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
Training set 1: Class 1
0: unlabelled data
-1: Class 2

Example Slide
Developing the training set: four rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
0: unlabelled data
-1: Class 2

Example Slide
Developing the training set: eight rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
0: unlabelled data
-1: Class 2

Example Slide
Developing the training set: one rule
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
0: unlabelled data
-1: Class 2

Example Slide
Developing the training set: twenty rules
0
20
40
60
80
100
120
140
160
180
200
1 0 -1
0: unlabelled data
-1: Class 2

Example Slide
Performance
Evaluation Metrics
Positive accuracy 68.3%
Negative accuracy 90.7%
Precision 71.8%
Recall 68.3%

Example Slide
Production: Iterations
• Snorkel codes run on the opportunities sent the day before, humans check the list and update a file with
real honor code violations.
• After doing unsupervised learning (topic modeling, word2vec) on the positive and negative HCV’s from
human generated data the rules are changed to improve positive accuracy.
In dynamic two sided
market places language
and behavior changes
continuously, hence
having iterations every
3-4 months keeps the
model fresh

Example Slide
Generalization: Matching Problem for Chegg Tutors
• Feature Generation for student tutor pairs.
• Chegg tutors is a two sided market place with students and tutors
being paired based on their overlapping characteristics.
• Generating features is an important part of creating this
recommendation system. Snorkel helps generate key phrases
associated with student-tutor pairs.

Example Slide
Behind training set generation: PGM’s
https://arxiv.org/pdf/1605.07723.pdf
Model the rules as
independent similar to
Naïve Bayes
Consider interdependencies between the rules.
Similar fix reinforce
exclude

Example Slide
Noisy sources of truth
credit:
https://hazyresearch.github.io/snorkel/

Example Slide
Generalization

Example Slide
Thank you
sdeb@chegg.com
@sangha_deb

Example Slide
Stanford NLP & Tools
Intent of Honor Code ViolationOpportunities
• Are students and tutors having
classes offsite?
• Are tutors comitting fraud?
• Are students/tutors using offensive
language?
• Do students want lessons
immediately or they are willing to
wait? …...
Other business questions
Other datasets: Chat

Natural Language Comprehension: Human Machine Collaboration.

Recommended

Recommended

More Related Content

Similar to Natural Language Comprehension: Human Machine Collaboration.

Similar to Natural Language Comprehension: Human Machine Collaboration. (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (15)

Recently uploaded

Recently uploaded (20)

Natural Language Comprehension: Human Machine Collaboration.