Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dato Confidential
Text Analysis with Machine Learning
1
• Chris DuBois
• Dato, Inc.
Dato Confidential
About me
Chris DuBois
Staff Data Scientist
Ph.D. in Statistics
Previously: probabilistic models of socia...
Dato Confidential
About Dato
3
We aim to help developers…
• develop valuable ML-based applications
• deploy models as inte...
Dato Confidential
Quick poll!
4
Dato Confidential
Agenda
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine l...
Dato Confidential
Applications of text analysis
6
Dato Confidential
Product reviews and comments
7
Applications of text analysis
Dato Confidential
User forums and customer service chats
8
Applications of text analysis
Dato Confidential
Social media: blogs, tweets, etc.
9
Applications of text analysis
Dato Confidential
News feeds
10
Applications of text analysis
Dato Confidential
Speech recognition and chat bots
11
Applications of text analysis
Dato Confidential
Aspect mining
12
Applications of text analysis
Dato Confidential
Quick poll!
13
Dato Confidential
Example of a product sentiment application
14
Dato Confidential
Text processing fundamentals
15
Dato Confidential16
Data
id timestamp user_name text
102 2016-04-05 9:50:31 Bob The food tasted bland
and uninspired.
103 ...
Dato Confidential17
Transforming string columns
id text
102 The food tasted bland and uninspired.
103 I would absolutely b...
Dato Confidential18
Tokenization
Convert each document into a list of tokens.
“The caesar salad was amazing.”INPUT
[“The”,...
Dato Confidential19
Bag of words representation
Compute the number of times each word occurs.
“The ice cream was the best....
Dato Confidential20
N-Grams (words)
Compute the number of times each set of words occur.
“Give it away, give it away, give...
Dato Confidential21
N-Grams (characters)
Compute the number of times each set of characters occur.
“mississippi”
{
“mis”: ...
Dato Confidential22
TF-IDF representation
Rescale word counts to discriminate between common and
distinctive words.
{“this...
Dato Confidential23
Remove rare words
Remove rare words (or pre-defined stopwords).
INPUT
OUTPUT
“I like green eggs3 and h...
Dato Confidential24
Split by sentence
Convert each document into a list of sentences.
INPUT
OUTPUT
“It was delicious. I wi...
Dato Confidential25
Extract parts of speech
Identify particular parts of speech.
INPUT
OUTPUT
“It was delicious. I will go...
Dato Confidential26
stack
Rearrange the data set to “stack” the list column.
id text
102 [“The food was great.”, “I will b...
Dato Confidential27
unstack
Rearrange the data set to “unstack” the string column.
id text
102 [“The food was great.”, “I ...
Dato Confidential28
Pipelines
Combine multiple transformations to create a single pipeline.
from graphlab.feature_engineer...
Dato Confidential
Machine learning toolkits
29
Dato Confidential
matches unstructured
text queries to a set of
strings
30
Task-oriented toolkits and building blocks
Auto...
Dato Confidential31
Nearest neighbors
id TF-IDF of text
Reference
NearestNeighborsModel
Query
id TF-IDF of text
int dict
q...
Dato Confidential32
Autotagger = Nearest Neighbors + Feature Engineering
id char
ngrams
…
Reference
Result
AutotaggerModel...
Dato Confidential33
Task-oriented toolkits and building blocks
SentimentAnalysis predicts
positive/negative
sentiment in
u...
Dato Confidential34
Logistic classifier
label feature(s)
Training data
LogisticClassifier
label feature(s)
int/str int/flo...
Dato Confidential35
Sentiment analysis = LogisticClassifier + Feature Engineering
label bag of words
Training data
Sentime...
Dato Confidential36
Product sentiment toolkit
>>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='s...
Dato Confidential
Code for deploying the demo
37
Dato Confidential
Advanced topics in modeling text data
38
Dato Confidential39
Topic models
Cluster words according to their co-occurrence in documents.
Dato Confidential40
Word2vec
Embeddings where nearby words are used in similar contexts.
Image: https://www.tensorflow.org...
Dato Confidential41
LSTM (Long-Short Term Memory)
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Train a...
Dato Confidential
Conclusion
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machi...
Dato Confidential
Next steps
Webinar material? Watch for an upcoming email!
Help getting started? Userguide, API docs, For...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Intelligent Applications with Machine Learning Toolkits
Next
Upcoming SlideShare
Intelligent Applications with Machine Learning Toolkits
Next
Download to read offline and view in fullscreen.

Share

Text Analysis with Machine Learning

Download to read offline

Presented by Chris DuBois

Related Books

Free with a 30 day trial from Scribd

See all

Text Analysis with Machine Learning

  1. 1. Dato Confidential Text Analysis with Machine Learning 1 • Chris DuBois • Dato, Inc.
  2. 2. Dato Confidential About me Chris DuBois Staff Data Scientist Ph.D. in Statistics Previously: probabilistic models of social networks and event data occurring over time. Recently: tools for recommender systems, text analysis, feature engineering and model parameter search. 2 chris@dato.com @chrisdubois
  3. 3. Dato Confidential About Dato 3 We aim to help developers… • develop valuable ML-based applications • deploy models as intelligent microservices
  4. 4. Dato Confidential Quick poll! 4
  5. 5. Dato Confidential Agenda • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 5
  6. 6. Dato Confidential Applications of text analysis 6
  7. 7. Dato Confidential Product reviews and comments 7 Applications of text analysis
  8. 8. Dato Confidential User forums and customer service chats 8 Applications of text analysis
  9. 9. Dato Confidential Social media: blogs, tweets, etc. 9 Applications of text analysis
  10. 10. Dato Confidential News feeds 10 Applications of text analysis
  11. 11. Dato Confidential Speech recognition and chat bots 11 Applications of text analysis
  12. 12. Dato Confidential Aspect mining 12 Applications of text analysis
  13. 13. Dato Confidential Quick poll! 13
  14. 14. Dato Confidential Example of a product sentiment application 14
  15. 15. Dato Confidential Text processing fundamentals 15
  16. 16. Dato Confidential16 Data id timestamp user_name text 102 2016-04-05 9:50:31 Bob The food tasted bland and uninspired. 103 2016-04-06 3:20:25 Charles I would absolutely bring my friends here. I could eat those fries every … 104 2016-04-08 11:52 Alicia Too expensive for me. Even the water was too expensive. … … … … int datetime str str
  17. 17. Dato Confidential17 Transforming string columns id text 102 The food tasted bland and uninspired. 103 I would absolutely bring my friends here. I could eat those fries every … 104 Too expensive for me. Even the water was too expensive. … … int str Tokenizer CountWords NGramCounter TFIDF RemoveRareWords SentenceSplitter ExtractPartsOfSpeech stack/unstack Pipelines
  18. 18. Dato Confidential18 Tokenization Convert each document into a list of tokens. “The caesar salad was amazing.”INPUT [“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT
  19. 19. Dato Confidential19 Bag of words representation Compute the number of times each word occurs. “The ice cream was the best.” { “the”: 2, “ice”: 1, “cream”: 1, “was”: 1, “best”: 1 } INPUT OUTPUT
  20. 20. Dato Confidential20 N-Grams (words) Compute the number of times each set of words occur. “Give it away, give it away, give it away now” { “give it”: 3, “it away”: 3, “away give”: 2, “away now”: 1 } INPUT OUTPUT
  21. 21. Dato Confidential21 N-Grams (characters) Compute the number of times each set of characters occur. “mississippi” { “mis”: 1, “iss”: 2, “sis”: 1, “ssi”: 2, “sip”: 1, ... } INPUT OUTPUT
  22. 22. Dato Confidential22 TF-IDF representation Rescale word counts to discriminate between common and distinctive words. {“this”: 1, “is”: 1, “a”: 2, “example”: 3} {“this”: .3, “is”: .1, “a”: .2, “example”: .9} INPUT OUTPUT Low scores for words that are common among all documents High scores for rare words occurring often in a document
  23. 23. Dato Confidential23 Remove rare words Remove rare words (or pre-defined stopwords). INPUT OUTPUT “I like green eggs3 and ham.” “I like green and ham.”
  24. 24. Dato Confidential24 Split by sentence Convert each document into a list of sentences. INPUT OUTPUT “It was delicious. I will be back.” [“It was delicious.”, “I will be back.”]
  25. 25. Dato Confidential25 Extract parts of speech Identify particular parts of speech. INPUT OUTPUT “It was delicious. I will go back.” [“delicious”]
  26. 26. Dato Confidential26 stack Rearrange the data set to “stack” the list column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  27. 27. Dato Confidential27 unstack Rearrange the data set to “unstack” the string column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  28. 28. Dato Confidential28 Pipelines Combine multiple transformations to create a single pipeline. from graphlab.feature_engineering import WordCounter, RareWordTrimmer f = gl.feature_engineering.create(sf, [WordCounter, RareWordTrimmer]) f.fit(data) f.transform(new_data)
  29. 29. Dato Confidential Machine learning toolkits 29
  30. 30. Dato Confidential matches unstructured text queries to a set of strings 30 Task-oriented toolkits and building blocks Autotagger = = Nearest neighbors + NGramCounter
  31. 31. Dato Confidential31 Nearest neighbors id TF-IDF of text Reference NearestNeighborsModel Query id TF-IDF of text int dict query_id reference_id score rank int int float int Result
  32. 32. Dato Confidential32 Autotagger = Nearest Neighbors + Feature Engineering id char ngrams … Reference Result AutotaggerModel Query id char ngrams … int dict query_id reference_id score rank int int float int
  33. 33. Dato Confidential33 Task-oriented toolkits and building blocks SentimentAnalysis predicts positive/negative sentiment in unstructured text = WordCounter + Logistic Classifier =
  34. 34. Dato Confidential34 Logistic classifier label feature(s) Training data LogisticClassifier label feature(s) int/str int/float/str/dict/list New data scores float Probability label=1
  35. 35. Dato Confidential35 Sentiment analysis = LogisticClassifier + Feature Engineering label bag of words Training data SentimentAnalysis label bag of words int/str dict New data scores float Sentiment scores (probability label = 1)
  36. 36. Dato Confidential36 Product sentiment toolkit >>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence') >>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow']) +---------+----------------+-----------------+--------------+ | keyword | sd_sentiment | mean_sentiment | review_count | +---------+----------------+-----------------+--------------+ | cable | 0.302471264675 | 0.285512408978 | 1618 | | slow | 0.282117103769 | 0.243490314737 | 369 | | cost | 0.283310577512 | 0.197087019219 | 291 | | charges | 0.164350792173 | 0.0853637431588 | 1412 | | late | 0.119163914305 | 0.0712757752753 | 2202 | | billing | 0.159655783707 | 0.0697454360245 | 583 | +---------+----------------+-----------------+--------------+
  37. 37. Dato Confidential Code for deploying the demo 37
  38. 38. Dato Confidential Advanced topics in modeling text data 38
  39. 39. Dato Confidential39 Topic models Cluster words according to their co-occurrence in documents.
  40. 40. Dato Confidential40 Word2vec Embeddings where nearby words are used in similar contexts. Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
  41. 41. Dato Confidential41 LSTM (Long-Short Term Memory) Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Train a neural network to predict the next word given previous words.
  42. 42. Dato Confidential Conclusion • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 42
  43. 43. Dato Confidential Next steps Webinar material? Watch for an upcoming email! Help getting started? Userguide, API docs, Forum More webinars? Benchmarking ML, data mining, and more Contact: Chris DuBois, chris@dato.com 43
  • EnesBerkKarahaner

    Mar. 16, 2018
  • nishanthchowdary1

    Aug. 4, 2017

Presented by Chris DuBois

Views

Total views

1,644

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

85

Shares

0

Comments

0

Likes

2

×