2. Dato Confidential
About me
Chris DuBois
Staff Data Scientist
Ph.D. in Statistics
Previously: probabilistic models of social networks
and event data occurring over time.
Recently: tools for recommender systems, text
analysis, feature engineering and model parameter
search.
2
chris@dato.com
@chrisdubois
3. Dato Confidential
About Dato
3
We aim to help developers…
• develop valuable ML-based applications
• deploy models as intelligent microservices
5. Dato Confidential
Agenda
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
5
16. Dato Confidential16
Data
id timestamp user_name text
102 2016-04-05 9:50:31 Bob The food tasted bland
and uninspired.
103 2016-04-06 3:20:25 Charles I would absolutely bring
my friends here. I could
eat those fries every …
104 2016-04-08 11:52 Alicia Too expensive for me.
Even the water was too
expensive.
… … … …
int datetime str str
17. Dato Confidential17
Transforming string columns
id text
102 The food tasted bland and uninspired.
103 I would absolutely bring my friends here. I
could eat those fries every …
104 Too expensive for me. Even the water was
too expensive.
… …
int str
Tokenizer
CountWords
NGramCounter
TFIDF
RemoveRareWords
SentenceSplitter
ExtractPartsOfSpeech
stack/unstack
Pipelines
19. Dato Confidential19
Bag of words representation
Compute the number of times each word occurs.
“The ice cream was the best.”
{
“the”: 2,
“ice”: 1,
“cream”: 1,
“was”: 1,
“best”: 1
}
INPUT
OUTPUT
20. Dato Confidential20
N-Grams (words)
Compute the number of times each set of words occur.
“Give it away, give it away, give it away now”
{
“give it”: 3,
“it away”: 3,
“away give”: 2,
“away now”: 1
}
INPUT
OUTPUT
22. Dato Confidential22
TF-IDF representation
Rescale word counts to discriminate between common and
distinctive words.
{“this”: 1, “is”: 1, “a”: 2, “example”: 3}
{“this”: .3, “is”: .1, “a”: .2, “example”: .9}
INPUT
OUTPUT
Low scores for words that are common among all documents
High scores for rare words occurring often in a document
23. Dato Confidential23
Remove rare words
Remove rare words (or pre-defined stopwords).
INPUT
OUTPUT
“I like green eggs3 and ham.”
“I like green and ham.”
24. Dato Confidential24
Split by sentence
Convert each document into a list of sentences.
INPUT
OUTPUT
“It was delicious. I will be back.”
[“It was delicious.”, “I will be back.”]
25. Dato Confidential25
Extract parts of speech
Identify particular parts of speech.
INPUT
OUTPUT
“It was delicious. I will go back.”
[“delicious”]
26. Dato Confidential26
stack
Rearrange the data set to “stack” the list column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
27. Dato Confidential27
unstack
Rearrange the data set to “unstack” the string column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
28. Dato Confidential28
Pipelines
Combine multiple transformations to create a single pipeline.
from graphlab.feature_engineering import
WordCounter, RareWordTrimmer
f = gl.feature_engineering.create(sf,
[WordCounter, RareWordTrimmer])
f.fit(data)
f.transform(new_data)
31. Dato Confidential31
Nearest neighbors
id TF-IDF of text
Reference
NearestNeighborsModel
Query
id TF-IDF of text
int dict
query_id reference_id score rank
int int float int
Result
32. Dato Confidential32
Autotagger = Nearest Neighbors + Feature Engineering
id char
ngrams
…
Reference
Result
AutotaggerModel
Query
id char ngrams …
int dict
query_id reference_id score rank
int int float int
33. Dato Confidential33
Task-oriented toolkits and building blocks
SentimentAnalysis predicts
positive/negative
sentiment in
unstructured text
=
WordCounter +
Logistic Classifier
=
34. Dato Confidential34
Logistic classifier
label feature(s)
Training data
LogisticClassifier
label feature(s)
int/str int/float/str/dict/list
New data
scores
float
Probability label=1
35. Dato Confidential35
Sentiment analysis = LogisticClassifier + Feature Engineering
label bag of words
Training data
SentimentAnalysis
label bag of words
int/str dict
New data
scores
float
Sentiment scores
(probability label = 1)
41. Dato Confidential41
LSTM (Long-Short Term Memory)
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Train a neural network to predict the next word given
previous words.
42. Dato Confidential
Conclusion
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
42
43. Dato Confidential
Next steps
Webinar material? Watch for an upcoming email!
Help getting started? Userguide, API docs, Forum
More webinars? Benchmarking ML, data mining, and more
Contact: Chris DuBois, chris@dato.com
43
Editor's Notes
We aim to help developers…
create and grow business value with ML: Immediately create a business application using our fully customizable toolkits, and continuously and easily enhance it:
Business-task centered ML toolkits: recommenders, churn prediction, lead scoring,…, not just ML algorithms like boosted trees and deep learning.
Automated model selection, parameter optimization, and evaluation.
Painless feature engineering.
Delightful user experience for developers:
Use SFrames to manipulate multiple data types, easily create features with pre-defined transformations or easily create your own.
50+ robust ML algorithms where out-of-core computation eliminate scale pains.
Deploy models as intelligent microservices: Easily compose intelligent applications from hosted microservices, using the customer’s own infrastructure or AWS
Host any model.
Performant for production.
Model experimentation, monitoring and management.