SlideShare a Scribd company logo
1 of 43
Dato Confidential
Text Analysis with Machine Learning
1
• Chris DuBois
• Dato, Inc.
Dato Confidential
About me
Chris DuBois
Staff Data Scientist
Ph.D. in Statistics
Previously: probabilistic models of social networks
and event data occurring over time.
Recently: tools for recommender systems, text
analysis, feature engineering and model parameter
search.
2
chris@dato.com
@chrisdubois
Dato Confidential
About Dato
3
We aim to help developers…
• develop valuable ML-based applications
• deploy models as intelligent microservices
Dato Confidential
Quick poll!
4
Dato Confidential
Agenda
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
5
Dato Confidential
Applications of text analysis
6
Dato Confidential
Product reviews and comments
7
Applications of text analysis
Dato Confidential
User forums and customer service chats
8
Applications of text analysis
Dato Confidential
Social media: blogs, tweets, etc.
9
Applications of text analysis
Dato Confidential
News feeds
10
Applications of text analysis
Dato Confidential
Speech recognition and chat bots
11
Applications of text analysis
Dato Confidential
Aspect mining
12
Applications of text analysis
Dato Confidential
Quick poll!
13
Dato Confidential
Example of a product sentiment application
14
Dato Confidential
Text processing fundamentals
15
Dato Confidential16
Data
id timestamp user_name text
102 2016-04-05 9:50:31 Bob The food tasted bland
and uninspired.
103 2016-04-06 3:20:25 Charles I would absolutely bring
my friends here. I could
eat those fries every …
104 2016-04-08 11:52 Alicia Too expensive for me.
Even the water was too
expensive.
… … … …
int datetime str str
Dato Confidential17
Transforming string columns
id text
102 The food tasted bland and uninspired.
103 I would absolutely bring my friends here. I
could eat those fries every …
104 Too expensive for me. Even the water was
too expensive.
… …
int str
Tokenizer
CountWords
NGramCounter
TFIDF
RemoveRareWords
SentenceSplitter
ExtractPartsOfSpeech
stack/unstack
Pipelines
Dato Confidential18
Tokenization
Convert each document into a list of tokens.
“The caesar salad was amazing.”INPUT
[“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT
Dato Confidential19
Bag of words representation
Compute the number of times each word occurs.
“The ice cream was the best.”
{
“the”: 2,
“ice”: 1,
“cream”: 1,
“was”: 1,
“best”: 1
}
INPUT
OUTPUT
Dato Confidential20
N-Grams (words)
Compute the number of times each set of words occur.
“Give it away, give it away, give it away now”
{
“give it”: 3,
“it away”: 3,
“away give”: 2,
“away now”: 1
}
INPUT
OUTPUT
Dato Confidential21
N-Grams (characters)
Compute the number of times each set of characters occur.
“mississippi”
{
“mis”: 1,
“iss”: 2,
“sis”: 1,
“ssi”: 2,
“sip”: 1,
...
}
INPUT
OUTPUT
Dato Confidential22
TF-IDF representation
Rescale word counts to discriminate between common and
distinctive words.
{“this”: 1, “is”: 1, “a”: 2, “example”: 3}
{“this”: .3, “is”: .1, “a”: .2, “example”: .9}
INPUT
OUTPUT
Low scores for words that are common among all documents
High scores for rare words occurring often in a document
Dato Confidential23
Remove rare words
Remove rare words (or pre-defined stopwords).
INPUT
OUTPUT
“I like green eggs3 and ham.”
“I like green and ham.”
Dato Confidential24
Split by sentence
Convert each document into a list of sentences.
INPUT
OUTPUT
“It was delicious. I will be back.”
[“It was delicious.”, “I will be back.”]
Dato Confidential25
Extract parts of speech
Identify particular parts of speech.
INPUT
OUTPUT
“It was delicious. I will go back.”
[“delicious”]
Dato Confidential26
stack
Rearrange the data set to “stack” the list column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
Dato Confidential27
unstack
Rearrange the data set to “unstack” the string column.
id text
102 [“The food was great.”, “I will be
back.”]
103 [“I would absolutely bring my
friends here.”, “I could eat those
fries every day.”]
104 [“Too expensive for me.”]
… …
int list
id text
102 “The food was great.”
102 “I will be back.”
103 “I would absolutely bring my friends
here.”
… …
int str
Dato Confidential28
Pipelines
Combine multiple transformations to create a single pipeline.
from graphlab.feature_engineering import
WordCounter, RareWordTrimmer
f = gl.feature_engineering.create(sf,
[WordCounter, RareWordTrimmer])
f.fit(data)
f.transform(new_data)
Dato Confidential
Machine learning toolkits
29
Dato Confidential
matches unstructured
text queries to a set of
strings
30
Task-oriented toolkits and building blocks
Autotagger =
= Nearest neighbors +
NGramCounter
Dato Confidential31
Nearest neighbors
id TF-IDF of text
Reference
NearestNeighborsModel
Query
id TF-IDF of text
int dict
query_id reference_id score rank
int int float int
Result
Dato Confidential32
Autotagger = Nearest Neighbors + Feature Engineering
id char
ngrams
…
Reference
Result
AutotaggerModel
Query
id char ngrams …
int dict
query_id reference_id score rank
int int float int
Dato Confidential33
Task-oriented toolkits and building blocks
SentimentAnalysis predicts
positive/negative
sentiment in
unstructured text
=
WordCounter +
Logistic Classifier
=
Dato Confidential34
Logistic classifier
label feature(s)
Training data
LogisticClassifier
label feature(s)
int/str int/float/str/dict/list
New data
scores
float
Probability label=1
Dato Confidential35
Sentiment analysis = LogisticClassifier + Feature Engineering
label bag of words
Training data
SentimentAnalysis
label bag of words
int/str dict
New data
scores
float
Sentiment scores
(probability label = 1)
Dato Confidential36
Product sentiment toolkit
>>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence')
>>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow'])
+---------+----------------+-----------------+--------------+
| keyword | sd_sentiment | mean_sentiment | review_count |
+---------+----------------+-----------------+--------------+
| cable | 0.302471264675 | 0.285512408978 | 1618 |
| slow | 0.282117103769 | 0.243490314737 | 369 |
| cost | 0.283310577512 | 0.197087019219 | 291 |
| charges | 0.164350792173 | 0.0853637431588 | 1412 |
| late | 0.119163914305 | 0.0712757752753 | 2202 |
| billing | 0.159655783707 | 0.0697454360245 | 583 |
+---------+----------------+-----------------+--------------+
Dato Confidential
Code for deploying the demo
37
Dato Confidential
Advanced topics in modeling text data
38
Dato Confidential39
Topic models
Cluster words according to their co-occurrence in documents.
Dato Confidential40
Word2vec
Embeddings where nearby words are used in similar contexts.
Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Dato Confidential41
LSTM (Long-Short Term Memory)
Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Train a neural network to predict the next word given
previous words.
Dato Confidential
Conclusion
• Applications of text analysis: examples and a demo
• Fundamentals
• Text processing
• Machine learning
• Task-oriented tools
• Putting it all together
• Deploying the pipeline for the application
• Quick survey of next steps
• Advanced modeling techniques
42
Dato Confidential
Next steps
Webinar material? Watch for an upcoming email!
Help getting started? Userguide, API docs, Forum
More webinars? Benchmarking ML, data mining, and more
Contact: Chris DuBois, chris@dato.com
43

More Related Content

What's hot

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Simplilearn
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHCool Guy
 
closure properties of regular language.pptx
closure properties of regular language.pptxclosure properties of regular language.pptx
closure properties of regular language.pptxThirumoorthy64
 

What's hot (20)

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Text clustering
Text clusteringText clustering
Text clustering
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCH
 
Lstm
LstmLstm
Lstm
 
closure properties of regular language.pptx
closure properties of regular language.pptxclosure properties of regular language.pptx
closure properties of regular language.pptx
 

Viewers also liked

Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsTuri, Inc.
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big DataMax Lin
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en Méxicombrionessauceda
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementQuantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementEmery Berger
 
Empowerment Awareness
Empowerment AwarenessEmpowerment Awareness
Empowerment Awarenessaltonbaird
 
Teoriaelctromagneticanocambio
TeoriaelctromagneticanocambioTeoriaelctromagneticanocambio
TeoriaelctromagneticanocambioRnr Navarro R
 
C. V Hanaa Ahmed
C. V Hanaa AhmedC. V Hanaa Ahmed
C. V Hanaa AhmedHanaa Ahmed
 
Η Σπάρτη
Η ΣπάρτηΗ Σπάρτη
Η Σπάρτηvasso76
 
LISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSLISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSguest0dbad523
 

Viewers also liked (17)

Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Presentacion de educación en México
Presentacion de educación en MéxicoPresentacion de educación en México
Presentacion de educación en México
 
LinkedIn for Beginners
LinkedIn for BeginnersLinkedIn for Beginners
LinkedIn for Beginners
 
Goal Centre e-bulletin Feb 2015
Goal Centre e-bulletin Feb 2015Goal Centre e-bulletin Feb 2015
Goal Centre e-bulletin Feb 2015
 
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementQuantifying the Performance of Garbage Collection vs. Explicit Memory Management
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management
 
Empowerment Awareness
Empowerment AwarenessEmpowerment Awareness
Empowerment Awareness
 
Ehealthhoy
EhealthhoyEhealthhoy
Ehealthhoy
 
Teoriaelctromagneticanocambio
TeoriaelctromagneticanocambioTeoriaelctromagneticanocambio
Teoriaelctromagneticanocambio
 
C. V Hanaa Ahmed
C. V Hanaa AhmedC. V Hanaa Ahmed
C. V Hanaa Ahmed
 
Η Σπάρτη
Η ΣπάρτηΗ Σπάρτη
Η Σπάρτη
 
Trabajo
TrabajoTrabajo
Trabajo
 
Cover Diari de Girona
Cover Diari de GironaCover Diari de Girona
Cover Diari de Girona
 
Taller de primer semestre
Taller de primer semestreTaller de primer semestre
Taller de primer semestre
 
certificate
certificatecertificate
certificate
 
LISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOSLISTADO OPOSICION + MERITOS
LISTADO OPOSICION + MERITOS
 
Publicidad
PublicidadPublicidad
Publicidad
 

Similar to Text Analysis with Machine Learning

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1HyeonSeok Choi
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeWim Godden
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptxLonghow Lam
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
Redis for your boss 2.0
Redis for your boss 2.0Redis for your boss 2.0
Redis for your boss 2.0Elena Kolevska
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseit-people
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data CrunchingMichelle D'israeli
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)Portland R User Group
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916Craig Chao
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsySmartHinJ
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 

Similar to Text Analysis with Machine Learning (20)

Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Mining the social web ch1
Mining the social web ch1Mining the social web ch1
Mining the social web ch1
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Redis for your boss 2.0
Redis for your boss 2.0Redis for your boss 2.0
Redis for your boss 2.0
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Amir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's databaseAmir Salihefendic: Redis - the hacker's database
Amir Salihefendic: Redis - the hacker's database
 
Powershell for Log Analysis and Data Crunching
 Powershell for Log Analysis and Data Crunching Powershell for Log Analysis and Data Crunching
Powershell for Log Analysis and Data Crunching
 
"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)"R & Text Analytics" (15 January 2013)
"R & Text Analytics" (15 January 2013)
 
從行動廣告大數據觀點談 Big data 20150916
從行動廣告大數據觀點談 Big data   20150916從行動廣告大數據觀點談 Big data   20150916
從行動廣告大數據觀點談 Big data 20150916
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 

More from Turi, Inc.

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission RiskTuri, Inc.
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Turi, Inc.
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Turi, Inc.
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Turi, Inc.
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Turi, Inc.
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataTuri, Inc.
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesTuri, Inc.
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinTuri, Inc.
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Turi, Inc.
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Text Analysis with Machine Learning

  • 1. Dato Confidential Text Analysis with Machine Learning 1 • Chris DuBois • Dato, Inc.
  • 2. Dato Confidential About me Chris DuBois Staff Data Scientist Ph.D. in Statistics Previously: probabilistic models of social networks and event data occurring over time. Recently: tools for recommender systems, text analysis, feature engineering and model parameter search. 2 chris@dato.com @chrisdubois
  • 3. Dato Confidential About Dato 3 We aim to help developers… • develop valuable ML-based applications • deploy models as intelligent microservices
  • 5. Dato Confidential Agenda • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 5
  • 7. Dato Confidential Product reviews and comments 7 Applications of text analysis
  • 8. Dato Confidential User forums and customer service chats 8 Applications of text analysis
  • 9. Dato Confidential Social media: blogs, tweets, etc. 9 Applications of text analysis
  • 11. Dato Confidential Speech recognition and chat bots 11 Applications of text analysis
  • 14. Dato Confidential Example of a product sentiment application 14
  • 16. Dato Confidential16 Data id timestamp user_name text 102 2016-04-05 9:50:31 Bob The food tasted bland and uninspired. 103 2016-04-06 3:20:25 Charles I would absolutely bring my friends here. I could eat those fries every … 104 2016-04-08 11:52 Alicia Too expensive for me. Even the water was too expensive. … … … … int datetime str str
  • 17. Dato Confidential17 Transforming string columns id text 102 The food tasted bland and uninspired. 103 I would absolutely bring my friends here. I could eat those fries every … 104 Too expensive for me. Even the water was too expensive. … … int str Tokenizer CountWords NGramCounter TFIDF RemoveRareWords SentenceSplitter ExtractPartsOfSpeech stack/unstack Pipelines
  • 18. Dato Confidential18 Tokenization Convert each document into a list of tokens. “The caesar salad was amazing.”INPUT [“The”, “caesar”, “salad”, “was”, “amazing.”]OUTPUT
  • 19. Dato Confidential19 Bag of words representation Compute the number of times each word occurs. “The ice cream was the best.” { “the”: 2, “ice”: 1, “cream”: 1, “was”: 1, “best”: 1 } INPUT OUTPUT
  • 20. Dato Confidential20 N-Grams (words) Compute the number of times each set of words occur. “Give it away, give it away, give it away now” { “give it”: 3, “it away”: 3, “away give”: 2, “away now”: 1 } INPUT OUTPUT
  • 21. Dato Confidential21 N-Grams (characters) Compute the number of times each set of characters occur. “mississippi” { “mis”: 1, “iss”: 2, “sis”: 1, “ssi”: 2, “sip”: 1, ... } INPUT OUTPUT
  • 22. Dato Confidential22 TF-IDF representation Rescale word counts to discriminate between common and distinctive words. {“this”: 1, “is”: 1, “a”: 2, “example”: 3} {“this”: .3, “is”: .1, “a”: .2, “example”: .9} INPUT OUTPUT Low scores for words that are common among all documents High scores for rare words occurring often in a document
  • 23. Dato Confidential23 Remove rare words Remove rare words (or pre-defined stopwords). INPUT OUTPUT “I like green eggs3 and ham.” “I like green and ham.”
  • 24. Dato Confidential24 Split by sentence Convert each document into a list of sentences. INPUT OUTPUT “It was delicious. I will be back.” [“It was delicious.”, “I will be back.”]
  • 25. Dato Confidential25 Extract parts of speech Identify particular parts of speech. INPUT OUTPUT “It was delicious. I will go back.” [“delicious”]
  • 26. Dato Confidential26 stack Rearrange the data set to “stack” the list column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  • 27. Dato Confidential27 unstack Rearrange the data set to “unstack” the string column. id text 102 [“The food was great.”, “I will be back.”] 103 [“I would absolutely bring my friends here.”, “I could eat those fries every day.”] 104 [“Too expensive for me.”] … … int list id text 102 “The food was great.” 102 “I will be back.” 103 “I would absolutely bring my friends here.” … … int str
  • 28. Dato Confidential28 Pipelines Combine multiple transformations to create a single pipeline. from graphlab.feature_engineering import WordCounter, RareWordTrimmer f = gl.feature_engineering.create(sf, [WordCounter, RareWordTrimmer]) f.fit(data) f.transform(new_data)
  • 30. Dato Confidential matches unstructured text queries to a set of strings 30 Task-oriented toolkits and building blocks Autotagger = = Nearest neighbors + NGramCounter
  • 31. Dato Confidential31 Nearest neighbors id TF-IDF of text Reference NearestNeighborsModel Query id TF-IDF of text int dict query_id reference_id score rank int int float int Result
  • 32. Dato Confidential32 Autotagger = Nearest Neighbors + Feature Engineering id char ngrams … Reference Result AutotaggerModel Query id char ngrams … int dict query_id reference_id score rank int int float int
  • 33. Dato Confidential33 Task-oriented toolkits and building blocks SentimentAnalysis predicts positive/negative sentiment in unstructured text = WordCounter + Logistic Classifier =
  • 34. Dato Confidential34 Logistic classifier label feature(s) Training data LogisticClassifier label feature(s) int/str int/float/str/dict/list New data scores float Probability label=1
  • 35. Dato Confidential35 Sentiment analysis = LogisticClassifier + Feature Engineering label bag of words Training data SentimentAnalysis label bag of words int/str dict New data scores float Sentiment scores (probability label = 1)
  • 36. Dato Confidential36 Product sentiment toolkit >>> m = gl.product_sentiment.create(sf, features=['Description'], splitby='sentence') >>> m.sentiment_summary(['billing', 'cable', 'cost', 'late', 'charges', 'slow']) +---------+----------------+-----------------+--------------+ | keyword | sd_sentiment | mean_sentiment | review_count | +---------+----------------+-----------------+--------------+ | cable | 0.302471264675 | 0.285512408978 | 1618 | | slow | 0.282117103769 | 0.243490314737 | 369 | | cost | 0.283310577512 | 0.197087019219 | 291 | | charges | 0.164350792173 | 0.0853637431588 | 1412 | | late | 0.119163914305 | 0.0712757752753 | 2202 | | billing | 0.159655783707 | 0.0697454360245 | 583 | +---------+----------------+-----------------+--------------+
  • 37. Dato Confidential Code for deploying the demo 37
  • 38. Dato Confidential Advanced topics in modeling text data 38
  • 39. Dato Confidential39 Topic models Cluster words according to their co-occurrence in documents.
  • 40. Dato Confidential40 Word2vec Embeddings where nearby words are used in similar contexts. Image: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
  • 41. Dato Confidential41 LSTM (Long-Short Term Memory) Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Train a neural network to predict the next word given previous words.
  • 42. Dato Confidential Conclusion • Applications of text analysis: examples and a demo • Fundamentals • Text processing • Machine learning • Task-oriented tools • Putting it all together • Deploying the pipeline for the application • Quick survey of next steps • Advanced modeling techniques 42
  • 43. Dato Confidential Next steps Webinar material? Watch for an upcoming email! Help getting started? Userguide, API docs, Forum More webinars? Benchmarking ML, data mining, and more Contact: Chris DuBois, chris@dato.com 43

Editor's Notes

  1. We aim to help developers… create and grow business value with ML: Immediately create a business application using our fully customizable toolkits, and continuously and easily enhance it: Business-task centered ML toolkits: recommenders, churn prediction, lead scoring,…, not just ML algorithms like boosted trees and deep learning. Automated model selection, parameter optimization, and evaluation. Painless feature engineering. Delightful user experience for developers: Use SFrames to manipulate multiple data types, easily create features with pre-defined transformations or easily create your own. 50+ robust ML algorithms where out-of-core computation eliminate scale pains. Deploy models as intelligent microservices: Easily compose intelligent applications from hosted microservices, using the customer’s own infrastructure or AWS Host any model. Performant for production. Model experimentation, monitoring and management.