Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

TOYOTA Machine Learning on
Apache SparkTM
FEATURE
ENGINEERING
Brian Kursar, Sr Data Scientist
Toyota Motor Sales IT Research and Development
Big Data Day LA 2015

Big Data
Increased Storage Capacity Faster Processing

TOYOTA Big Data History
2015
2014
2013
2012
2011
2010
C360 - Next Gen Insights Platform
Over 6B Records
C360 - Customer Experience Analytics
Over 700M Records
C360 - Toyota Social Media Intelligence Center
Over 500M Records
Product Quality Analytics v2
Over 120M Records
Marketing and Incentives Analytics
70M Records
Product Quality Analytics
Over 60M Records

Sentiment Analysis
Basic Sentiment Analysis is not enough

Existing Tools
Jan – Feb Feb - Mar
+1% +1%
+1% +2%
Toyota Social Opinion 2013
WHY?
It doesn’t give you the “Why”

40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
Toyota Online Conversations by the Numbers
2014
Study

Toyota Online Conversations by the Numbers
40% Retailers Selling Toyota Vehicles
11% Opinions on Marketing Campaigns
10% Feedback on Dealer Sales and Service Experiences
9% Opinions on Product Styling and Features
8% People In Market for a Toyota
8% Incident Reports Involving a Toyota Vehicle
7% Feedback on Product Quality
5% Customers Advocating for the Brand
2% Completely Irrelevant
2014
Study
50%
Noise

Millions of Social Media posts a day and not
enough Resources to read them all
Problem Statement

Categorize and Prioritize incoming Social Media
interactions in Real-Time using Machine Learning to
provide ACTIONABLE INSIGHTS
Campaign
Opinions
Customer
Feedback
Product
Feedback Noise
Technology Opportunities

Is this the image that executives at your company have when they hear
the words “Machine Learning?” If so, help make it relative.

First Spark MLlib Experiment
• Seat Cover Wrinkles/Cracking
• Brake Noise
• Shift Quality
• Oil Leaks
• HVAC Odor
• Dead Battery
• Rodent Wire Harness Damage
• Paint Chips
Time-box project to 12 Weeks
Classify at min 80% accuracy

How does one
find training data
for a noise they
have never
personally
experienced?

NoiseBrakes
If it was as easy as using key words “Brake” and “Noise”
then why bother use Machine Learning?

Where can we find categorized specific Product Quality
Concerns reported by Customers to use as Training Data?
Challenge

Mechanical Turk
The Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace
that enables individuals and businesses (known as Requesters) to coordinate
the use of human intelligence to perform tasks that computers are currently
unable to do.

Mechanical Turk
Define Social data
with Keywords
“Brake” AND Toyota
Product line (i.e.
Prius, Camry, Avalon,
etc.)
Define
Leverage results from
Mechanical Turk as
positive and negative
training sets.
Train
Create Human
Intelligence Tasks
“Is this comment
indicative of a
problem concerning
Brake Noise?”
Y
N
Create

When I’m backing up in my 2012 Prius, it sounds like
something hanging up or scraping as it rotates and only
happens in the morning..
I hear a squeak coming from the back wheel of
my Prius as I pull out from my driveway in the
morning.
True positives missed by workers in labeling due to lack of
experience with the problem they are looking to identify.

Leverage Similar Internal Datasets as Training Data

Internal Data can help build Training set with relevant Features

Social ML Pipeline
Hand Labeled
Social
Labeled Internal
Survey
Responses
Featurizer
Chi-Square Feature Selector
All
Random
Top Popular
Top Random
N-Grams
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Natural Language Processing
Training Data

Extract Text Features
Statistics.chiSqTest(vec)

Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Labeled Survey
Responses
Featurizer
All
Random
Top Popular
Top Random
N-Grams
Support Vector Machines
Training Selection Filters
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features

Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Synonyms POS
Labeled Survey
Responses
Featurizer
All
Random
Top Popular
Top Random
N-Grams
Training Set
9 Fold
Validation Set
1 Fold
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Features
Model

False Positives
I just had my friend at the toyota dealer rotate my tires and he
said … that the brake pads are getting thin really fast.
So what should I do when they get too thin in the future and start
to squeak?

False Positives
i cut the iac hose as shown in figure 20 in the manual but when i
start the car, it started gasping for air... choking...
sounds like it's about to die out.
i bought the power brake check valve (80190 part for
kragen)... but either i'm not installing it right or it's the wrong
size... i have no idea.

Explicit Semantic Analysis
“word” ---> <concept1, weight1>, <concept2, weight2>,
<concept3, weight3> REPAIR MANUAL

Noise
caliper
pads
rotor
wheel
squeak grind
groan
squeal
Brakes
drum
Build Concepts

Noise
squeak grind
groan
squeal
Measure Distance Similarity Between Concepts
pads
caliper
rotor
wheel
drum
Brakes
Distance is calculated between
concepts based on the
Minkowski distance formula.

Kaizen
= Continuous Improvement

Kaizen
Train
TestEvaluate
Refine

Social ML Pipeline
Training Data
Hand Labeled
Social
Cleanse
N-Gram
Extraction
Stemming
Stop
Words
Labeled Survey
Responses
Vectorizer
TF*IDF
TF Options
Natural
Boolean
Log
Augmented
IDF Options
Unary
Inverse
Inv Smooth
ProbDF
Featurizer
All
Random
Top Popular
Top Random
Training Set
9 Fold
Validation Set
1 Fold
N-Grams
Model
Synonyms POS
Distance Similarity
Euclidean
Distance (L2)
-
√(|x|^ 2 + |y|^ 2)
Manhattan
Distance (L1)
-
|x|+ |y|
Concept Manager
Concept Interpreter
Negation Evaluator

• Education and Inclusion
TEAM TOYOTA TIPS

TO STEM OR NOT TO STEM
noun verb
When does stemming cause your entities to change?

TO STEM OR NOT TO STEM
noun noun
Will stemming produce false positives?

Negation – to contradict or deny something.
NEGATION
Great work done in medical document retrieval can be
leveraged. How different is a sick person from a sick car?

• absence
• did not
• didn't
• doesn't
• isn’t
• neither
• never
• no
• none
• without
NEGATION
did not
absence
no isn’t
none
Negation
never

I a am an and any my no the took
N-GRAMS AND STOP WORDS
Generic definition - the most common words in a language.
• For some models, Unigrams (single word) can be problematic due to lack of adjacent term which may assist in
disambiguation or could indicate a voice.
• Helps to keep possessive nouns especially in the feature set versus out via a stop list. Especially for Twitter
data. It helps identify a “Voice” versus marketing noise.
I a am an and any my no the took

• Each Model should have its own carefully selected stop words list
• Utilize unimportant entities in Stop Words Lists
N-GRAMS AND STOP LISTS

"interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}
CONTEXT IS KING
},
• Not all Social Media Interactions are
equal
• Some offer better Metadata than
others
• Leverage relevant Metadata as
features, feature weights, or filters
"interaction": {
"source": "web",
"author": {
"username": "johndoe",
"name": "John Doe",
“id": 10750902,
“parentid": 10750901,
"avatar": "http://a0.twimg.com/profile_images/1111111111/example.jpeg"
"link": "http://twitter.com/johndoe"
},
"type": “reply",
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000",
"content": "I like ice cream!",
"id": "1e1e875ab43fa233e074337458bc1dca",
"link": "http://twitter.com/johndoe/statuses/111111111111111111",
"geo": {
"latitude": 42.376104,
"longitude": -71.237189
}
},
"twitter": {
"created_at": "Fri, 17 Aug 2012 14:13:08 +0000"
},
"demographic": {
"gender": "mostly_male"
}

• Leverage Metadata to set Context to your documents
• Review Sites – Ask a Pointed Question
CONTEXT IS KING
I love it!
Venue Name Movie Name
Metadata can tell you what “it” is.
Product Name Brand Name

CONTEXT IS KING
Interaction Type = Reply
Child Post
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Interaction Type = Tweet
Parent Post
Your new Prius is awesome.
How do you like it?
Context

CONTEXT IS KING
Interaction Type = Tweet
Interaction Author = @MrsKursar
I love it. Although, every morning, I keep
hearing growling and squealing noises
coming from the back seat. :P
Same text different context

Brian Kursar – Sr Data Scientist
Toyota Motor Sales IT Research and Development
@briankursar
toyota.com/careers

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota

Similar to Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota