SlideShare a Scribd company logo
Paul Lo, 2018/12 @
Data Analytics @ Uber, Asia-Pacific Community Operation Central team |
Improving User Experience with Text Mining
and Deep Learning in Uber
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Who am I?
What does our analytics team do for
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
Skills: Full stack software engineer (Java/ Python) → Data Analyst (Python/ R, databases, machine learning)
Scope of Community Operation in Uber APAC
10+ languages in ~20 countries
Central Team
based in
Singapore (South East and North Asia)
Data @ Uber
Uber’s Data Lake
Stores 30+ Petabytes of data
~M clusters across N data centers
(thousands of servers)
So how much data is that really?
~100,000 years of music
Which is 50x the amount of music streamed on spotify
each year
50+ billion books or 50 million kindles
Equivalent to the entire written works of mankind from
the beginning of recorded history, in all languages
150+ years of 24/7 Full HD video recording
The amount of storage required to render 50 Avatar
movies, simultaneously
How big is Big data?
Data-driven business decision culture
Data helps us to tell the story to public and operate better
Typical policy and communications questions:
● How many jobs does Uber provide in Taipei?
● How is Uber pool reducing congestion in Manila?
● What proportion of our trips start or end at public transportation?
** Uber開源城市交通資料 :
Typical city operation questions:
● Do we have enough drivers for the New Year?
● How can we reduce the ETA for our riders?
● When is best to introduce EATS delivery fee in my city?
Data tools to support Big data
What’s our roles at Uber
Uber’s Data Lake
App + Support Data:
Rides, Eats, and etc
Payments Data:
Collection, Payments
External Data:
Traffic, Weather,
Holidays, Maps
Machine learning
Query interface
Internal BI Tools
Marketing Data:
Clicks, Impressions,
Improving user experience is one of our core mission
Improve user experience
Drive down defect rate
Optimize operational efficiency
Manage the cost of business operation
Project #1:
Text mining and NLP for use experience
Acknowledgement: Troy James Palanca, Lorenzo Ampil
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
and etc.
Root cause analysis
and recommended
feature or policy
feedback in
User experience
Value proposition
Speed up the workflow on user experience enhancement
Defect rate and issue type
and etc.
Root cause analysis
and recommended
feature or policy
feedback in
User experience
Making this process more efficient
How can we quickly get the insights from users’ feedback?
Reviewing tickets
manually to diagnose
the root cause is not
scalable and
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
ticket ticket
How can we quickly get the insights from users’ feedback?
Use topic modeling
techniques to
efficiently group tickets
and assign them to
reasonably named
Ticket dataset
Driver > Trips > Fare … > … > Technical issue
App stuck/ crash
Fare calculation
GPS issue
Key features of our solution
Using Topic modeling based tool to learn pain points from our users
Ticket snippet with user profile: respective ticket
samples are displayed when clicking on a keyword
Word cloud view: user can switch to
this view to see most relevant (tf-idf
score) keywords in each topic
Sample results
“Fare Disputes” in one of the city we operate are
mainly about payments, airport issues, and wrong
● Credit cards and other modes of payment
● Overcharging (28.8%)
● Wrong profiles being billed (12.8%)
● Airport terminal issues (12.9%)
● Someone else taking the trip (12.5%)
Sample results
Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords
were detected as the pain points of our NY driver partners
Sample results
More than 10% of driver cancellation
tickets in Singapore are related to car
seat rules for child safety: many
sample tickets show that drivers want to
reimburse their cancellation fee due to
their riders bringing children without prior
Tool architecture
Computing node
(any Uber servers)
Data collection
Data preparation
LDA model training
Web server
(AWS node)
Html and json
files from
training results
User Interface
Train the model for each country with top issues
Workflow overview
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Text processing library: nltk, BeautifulSoup, re, TextBlob
LDA library: gensim.ldamodel.LdaModel and pyLDAvis
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Remove invalid words:
● Numbers
● Html tags
● Custom dictionary
Stemming and lemmatization
TFIDF (Term Frequency Inverse Document
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Remove invalid words:
● Numbers
re.sub(r'd+', '', text)
● Html tags
● Custom dictionary
Stemming and lemmatization
TFIDF (Term Frequency Inverse Document
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Remove invalid words
Stemming and lemmatization: Reduce inflectional
forms and sometimes derivationally related forms of a
word to a common base form. For instance:
○ cancel, cancels, cancelled -> cancel
○ riders, rider -> rider
TFIDF (Term Frequency Inverse Document
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
TFIDF (Term Frequency Inverse Document
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content
Sample ~50,000 tickets for
each training in each issue
Remove invalid words
Stemming and lemmatization
Tokenization: Part-of-speech based word
TFIDF (Term Frequency Inverse Document
Frequency) Common practice to score each term
with weighted frequency and relevance
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Data Preparation (Natural Language Processing)
Using TFIDF to filter the most important keywords
Machine Learning
Term frequency
Inverse Document
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Data preparation can be very time-consuming
Sample ~50,000 tickets for
each training in each issue
Remove invalid words:
Stemming and lemmatization
TFIDF (Term Frequency Inverse Document
Speed up data processing
Pandas runs on a single thread by default
A pandas DataFrame with 50k+ rows
Data Preparation
text_processing() is a heavy function
contains many things:
● Tokenization
● Removal of numbers, html tags, and
other invalid words
● Stemming and lemmatization
→ single thread by default
Speed up data processing
Pandas runs on a single thread by default
Worker 1
Worker 2
Worker N
Data processing speedup trick in Pandas
Pandas runs on a single thread by default
Many handy text processing libraries
TextBlob as an example
Tokenization Sentence correction
Part of speech
Sentiment analysis
NLP Library
Workflow overview
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Unlocking support insights from textual content - but how?
Sample ~50,000 tickets for
each training in each issue
- Unsupervised learning
- Bag of words
- “topic distribution”
lda = LdaModel(corpus=corpus,
Latent Dirichlet Allocation model
General concept of this model
Unsupervised learning method - does not
require any class labels; similar to clustering
‘Bag of words’ model - uses word counts in
messages without regard for its order
(Peter owe Alice money = Alice owe Peter
Estimated iteratively - Starts with random
initialization then adjusts probabilities to
reduce perplexity / increase fit
(EM; Expectation Maximization)
Doc 1 Doc 2 Doc 3 Doc n...
(topic) FruitsFruits
30% health (topic
60% fruits
(topic 2)
10% disease
(topic 3)
Latent Dirichlet Allocation model
Model implementation and visualization
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Data input: ticket text as raw
Output: topic model clusters
Sample ~50,000 tickets for
each training in each issue
lda = LdaModel(corpus=corpus,
from pyLDAvis.gensim import prepare, save_html
from gensim.models import LdaModel
Future work and learnings
Data Preparation
(text processing)
Extract useful information and
transform corpus to a sparse
Data Modeling
(Latent Dirichlet
Main computation to perform
topic modeling
Customization is needed
● Not suited for
specific issue
● Build own
dictionary for the
removal of
irrelevant words
Data input: ticket text as raw
Output: topic model clusters
How to make the results more useful and actionable?
● # of topic for convergence
● Time and performance
● Other ”Deep NLP” model ?
Project #1
Text ming tool to unlock user insights
Python lib: natural language processing,
topic modeling
Who am I?
What does our analytics team do for
Project #2
Deep learning-based answering bot
for call center
Python lib: machine learning related
such as tensorflow, keras, sklearn,
numpy, and etc.
Improving User Experience with Text Mining and Deep Learning in Uber
Table of contents
Improving User
Experience with Text
Mining and Deep
Learning in Uber
Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team
Project #2:
Artificial Intelligence revolution in call centers
CSR’s sample workflow to respond user in a call center
How does our users submit an issue?
CSR’s sample workflow to respond user in a call center
Online support via in-app-help
The issue for call center operation: scalability and cost
The growth comes at a price again….
Solution? Let’s start from a basic sample
“I want to change my rating for a rider” - very rule-based deterministic flow
The business impact of a simple bot-solving solution
3k+ weekly solves
A team of
18 CSR
28k USD
What’s the problem with this solution?
The difference between Programming and Machine Learning
Our machine learning solution design
Why go with “Semi-automated” assistance rather than real robot?
Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
Our machine learning solution design
‘Assistant to CSR’ - Provide suggestions for reply and actions
Issue category/ type suggestion
Action suggestion
10M+ tickets
Correct response from
agents to these 10M+
Technical model training Product design
Typical Machine Learning process
Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
Typical Machine Learning process
Model selection
ML 101:
Start with simple model first
Data source:
Deep Learning Architecture
Reference: Uber AML Lab:
1000+ multiclass problem
10M English tickets (10-day)
Deep Learning Architecture
Reference: Uber AML Lab:
Sample code with Keras for a simple CNN
Deep Learning Architecture
Reference: Uber AML Lab:
arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link)
CNN: Max pooling
Optimizers: Adam (SGD, RMSProp), Batch Normalization
Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
Development environment for Deep learning model training
How does model training look like?
Main codebase + data set
Feature engineering and feature importance
Trade off between capacity and interpretability
Feature engineering and feature importance
What are the important features? Very easy to learn that in simpler model
Feature engineering and feature importance
What are the important features? Very easy to get explanation in simpler models
Feature engineering and feature importance
What are the important features? NN model is like our brain’s intuition … blackbox
Feature engineering process
What are the important features?
Trick: 資料量太大, 重新跑模型很久 →
Feature engineering and feature importance
What are the important features?
Sklearn: Recursive feature elimination
Feature engineering and feature importance
What are the important features?
Time on model training >>> prediction
Shuffle each feature to create noise…. on the testing set
Feature engineering and feature importance
What are the important features?
Shuffle each feature to create noise…. on the testing set
Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
Why NumPy is faster?
Python Vectorization: Single Instruction, Multiple Data (SIMD)
Why NumPy is faster?
Python Vectorization: Locality of reference (Spatial Locality)
Java/ C++ versus Python…...
Issue category suggestion
Action suggestion
Product design
Last stop: making business Impact
Ensure KPI measurement is well-planned in the beginning
Last stop: making business Impact
Identify key business metrics, and cautiously conduct and monitor experiments
Experiment notes:
* Network effect → Switch
back instead of A/B test
* Guardrail variable and
decision variable (risk control)
* Monitoring versus peeking
* Novelty effect
Other leanings
How to become a better programmer, or data scientist?
Other leanings
How to become a better programmer, or data scientist?
● Long-term growth: Not just know how to call APIs →
○ Understand what’s happening beneath (math and low-level
manipulation are key)
○ Understand pros and cons of your tool/ model/ framework
● Coding at scale: Resource and infra are rich, but data is also
huge (as well as the risk) → time and space optimization
optimization but not overdesign
● Communication: Everybody is busy → organize and
communicate your work well, and build good social relationship
Recommended reading 推薦閱讀
How to become a better programmer, or data scientist? 多看書,多寫扣,多分享
Data Science from Scratch: 用python學資料科學
Numpy, Scipy, Pandas
Java我的推薦聖經是Effective Java
Recommended reading
How to become a better programmer, or data scientist? Read & Code & Share, and repeat
Machine Learning and Deep Learning with Python
Focus on scikit-learn and TensorFlow
Data Science from Scratch
Highly recommend: Python-based hand-in-hand
On classical concepts and algorithms
Paul Lo
Data Analytics @ Uber
paul.lo *a*t | paullo0106 a*t*

More Related Content

What's hot

PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
Ml product page
Ml product pageMl product page
Ml product page
Janu Jahnavi
Ml product page
Ml product pageMl product page
Ml product page
Janu Jahnavi
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
Ivo Andreev
ETL & Machine Learning
ETL & Machine LearningETL & Machine Learning
ETL & Machine Learning
Luthfi Hariz
Mentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance RoboticsMentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance Robotics
Dony Riyanto
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Ning Jiang
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
Johann Schleier-Smith
Wolfram alpha A Computational Knowledge Engine Interesting Technology
Wolfram alpha A Computational Knowledge Engine  Interesting Technology Wolfram alpha A Computational Knowledge Engine  Interesting Technology
Wolfram alpha A Computational Knowledge Engine Interesting Technology
Manish Kumar
Machine learning using spark Online Training
Machine learning using spark Online TrainingMachine learning using spark Online Training
Machine learning using spark Online Training

What's hot (19)

PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
Ml product page
Ml product pageMl product page
Ml product page
Ml product page
Ml product pageMl product page
Ml product page
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
ETL & Machine Learning
ETL & Machine LearningETL & Machine Learning
ETL & Machine Learning
Mentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance RoboticsMentoring Session with Innovesia: Advance Robotics
Mentoring Session with Innovesia: Advance Robotics
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Ryan Curtin, Principal Research Scientist, Symantec at MLconf ATL 2016
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
Wolfram alpha A Computational Knowledge Engine Interesting Technology
Wolfram alpha A Computational Knowledge Engine  Interesting Technology Wolfram alpha A Computational Knowledge Engine  Interesting Technology
Wolfram alpha A Computational Knowledge Engine Interesting Technology
Machine learning using spark Online Training
Machine learning using spark Online TrainingMachine learning using spark Online Training
Machine learning using spark Online Training

Similar to [] improving user experience with text mining and deep learning in Uber

[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
Paul Lo
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
DCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQDCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQ
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
Andrei Lopatenko
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
José António Silva
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
Varun Nathan
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
UKSG: connecting the knowledge community
Strata - Final_IB_02_17
Strata - Final_IB_02_17Strata - Final_IB_02_17
Strata - Final_IB_02_17Irina Borisova
Object Automation
Object Automation
Object Automation
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
NESMA - More than just points
NESMA - More than just pointsNESMA - More than just points
NESMA - More than just points
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabad
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt
Soham De

Similar to [] improving user experience with text mining and deep learning in Uber (20)

[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...[PythonPH] Transforming the call center with Text mining and Deep learning (C...
[PythonPH] Transforming the call center with Text mining and Deep learning (C...
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
DCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQDCXS best selfcare-solutions DynamicFAQ
DCXS best selfcare-solutions DynamicFAQ
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...AI in Multi Billion Search Engines. Career building in AI / Search. What make...
AI in Multi Billion Search Engines. Career building in AI / Search. What make...
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Itpe brief
Itpe briefItpe brief
Itpe brief
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queriesML Framework for auto-responding to customer support queries
ML Framework for auto-responding to customer support queries
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
Strata - Final_IB_02_17
Strata - Final_IB_02_17Strata - Final_IB_02_17
Strata - Final_IB_02_17
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
NESMA - More than just points
NESMA - More than just pointsNESMA - More than just points
NESMA - More than just points
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabad
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES

[] improving user experience with text mining and deep learning in Uber

  • 1. Paul Lo, 2018/12 @ Data Analytics @ Uber, Asia-Pacific Community Operation Central team | Improving User Experience with Text Mining and Deep Learning in Uber
  • 2. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Deep learning-based answering bot for call center Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Improving User Experience with Text Mining and Deep Learning in Uber Table of contents Improving User Experience with Text Mining and Deep Learning in Uber
  • 3. Self-introduction Skills: Full stack software engineer (Java/ Python) → Data Analyst (Python/ R, databases, machine learning) Self-introduction
  • 4. Scope of Community Operation in Uber APAC Scope 10+ languages in ~20 countries Central Team based in Manila, Singapore, India India Singapore (South East and North Asia) Australia
  • 5. Data @ Uber Uber’s Data Lake Stores 30+ Petabytes of data ~M clusters across N data centers (thousands of servers) So how much data is that really? ~100,000 years of music Which is 50x the amount of music streamed on spotify each year 50+ billion books or 50 million kindles Equivalent to the entire written works of mankind from the beginning of recorded history, in all languages 150+ years of 24/7 Full HD video recording The amount of storage required to render 50 Avatar movies, simultaneously How big is Big data?
  • 6. Data-driven business decision culture Data helps us to tell the story to public and operate better Typical policy and communications questions: ● How many jobs does Uber provide in Taipei? ● How is Uber pool reducing congestion in Manila? ● What proportion of our trips start or end at public transportation? ** Uber開源城市交通資料 : Typical city operation questions: ● Do we have enough drivers for the New Year? ● How can we reduce the ETA for our riders? ● When is best to introduce EATS delivery fee in my city?
  • 7. Data tools to support Big data Source:
  • 8. What’s our roles at Uber Uber’s Data Lake App + Support Data: Rides, Eats, and etc Payments Data: Collection, Payments External Data: Traffic, Weather, Holidays, Maps Machine learning platform Programming interface Query interface Internal BI Tools Company-wide dashboards Marketing Data: Clicks, Impressions, Sentiment
  • 9. Improving user experience is one of our core mission Improve user experience Drive down defect rate Optimize operational efficiency Manage the cost of business operation
  • 10. Project #1: Text mining and NLP for use experience enhancement Acknowledgement: Troy James Palanca, Lorenzo Ampil
  • 11. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement
  • 12. Value proposition Speed up the workflow on user experience enhancement Defect rate and issue type Leaderboard Community Operation Product, Engineering, and etc. User feedback database Root cause analysis and recommended feature or policy changes Review customer feedback in tickets User experience enhancement Making this process more efficient
  • 13. Problem How can we quickly get the insights from users’ feedback? Problem Reviewing tickets manually to diagnose the root cause is not scalable and unsystematic Ticket dataset Driver > Trips > Fare … > … > Technical issue ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket ticket
  • 14. Problem How can we quickly get the insights from users’ feedback? Solution Use topic modeling techniques to efficiently group tickets and assign them to reasonably named topics. Ticket dataset Driver > Trips > Fare … > … > Technical issue App stuck/ crash (35%) Fare calculation Dispute (15%) GPS issue (55%)
  • 15. Key features of our solution Using Topic modeling based tool to learn pain points from our users Ticket snippet with user profile: respective ticket samples are displayed when clicking on a keyword Word cloud view: user can switch to this view to see most relevant (tf-idf score) keywords in each topic >>DEMO
  • 16. Sample results “Fare Disputes” in one of the city we operate are mainly about payments, airport issues, and wrong riders: ● Credit cards and other modes of payment (18%) ● Overcharging (28.8%) ● Wrong profiles being billed (12.8%) ● Airport terminal issues (12.9%) ● Someone else taking the trip (12.5%)
  • 17. Sample results Lots of “rude”, “loud music”, “drunk”, and “slam door” keywords were detected as the pain points of our NY driver partners
  • 18. Sample results More than 10% of driver cancellation tickets in Singapore are related to car seat rules for child safety: many sample tickets show that drivers want to reimburse their cancellation fee due to their riders bringing children without prior notice.
  • 19. Tool architecture Computing node (any Uber servers) Data collection Data preparation LDA model training Web server (AWS node) Html and json files from training results User Interface (d3js) Train the model for each country with top issues monthly
  • 20. Workflow overview Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category
  • 21. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Text processing library: nltk, BeautifulSoup, re, TextBlob LDA library: gensim.ldamodel.LdaModel and pyLDAvis
  • 22. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers ● Html tags ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 23. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words: ● Numbers re.sub(r'd+', '', text) ● Html tags BeautifulSoup(document).get_text() BeautifulSoup(document).find_all(‘b’) ● Custom dictionary Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 24. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization: Reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: ○ cancel, cancels, cancelled -> cancel ○ riders, rider -> rider Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 25. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency)
  • 26. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content Sample ~50,000 tickets for each training in each issue category Remove invalid words Stemming and lemmatization Tokenization: Part-of-speech based word detection TFIDF (Term Frequency Inverse Document Frequency) Common practice to score each term with weighted frequency and relevance
  • 27. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model
  • 28. Data Preparation (Natural Language Processing) Using TFIDF to filter the most important keywords Machine Learning Model Term frequency Inverse Document Frequency
  • 29. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Data preparation can be very time-consuming Sample ~50,000 tickets for each training in each issue category Remove invalid words: Stemming and lemmatization Tokenization TFIDF (Term Frequency Inverse Document Frequency)
  • 30. Speed up data processing Pandas runs on a single thread by default A pandas DataFrame with 50k+ rows Data Preparation text_processing() is a heavy function contains many things: ● Tokenization ● Removal of numbers, html tags, and other invalid words ● Stemming and lemmatization ● TFIDF df['content'].apply(text_processing) → single thread by default
  • 31. Speed up data processing Pandas runs on a single thread by default Worker 1 Worker 2 Worker N keywords
  • 32. Data processing speedup trick in Pandas Pandas runs on a single thread by default 1 2 3 4 5 6 7 8 9 10
  • 33. Many handy text processing libraries TextBlob as an example Tokenization Sentence correction .correct() Part of speech .tags Sentiment analysis .sentiment.polarity NLP Library (TextBlob) (spaCy)
  • 34. Workflow overview Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Unlocking support insights from textual content - but how? Sample ~50,000 tickets for each training in each issue category LDA: - Unsupervised learning - Bag of words - “topic distribution” Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics()
  • 35. Latent Dirichlet Allocation model General concept of this model Unsupervised learning method - does not require any class labels; similar to clustering ‘Bag of words’ model - uses word counts in messages without regard for its order (Peter owe Alice money = Alice owe Peter money) Estimated iteratively - Starts with random initialization then adjusts probabilities to reduce perplexity / increase fit (EM; Expectation Maximization) Doc 1 Doc 2 Doc 3 Doc n... (topic) FruitsFruits document-topic probabilities 30% health (topic 1) 60% fruits (topic 2) 10% disease (topic 3)
  • 36. Latent Dirichlet Allocation model Model implementation and visualization Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Data input: ticket text as raw data Output: topic model clusters Sample ~50,000 tickets for each training in each issue category Usage: lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=4, random_state=some_number) lda.show_topics() from pyLDAvis.gensim import prepare, save_html from gensim.models import LdaModel
  • 37. Future work and learnings Data Preparation (text processing) Extract useful information and transform corpus to a sparse matrix Data Modeling (Latent Dirichlet Allocation) Main computation to perform topic modeling Customization is needed ● Not suited for specific issue category ● Build own dictionary for the removal of irrelevant words Data input: ticket text as raw data Output: topic model clusters How to make the results more useful and actionable? ● # of topic for convergence ● Time and performance tradeoff ● Other ”Deep NLP” model ? Word2vec GloVe Fasttext
  • 38. Project #1 Text ming tool to unlock user insights Python lib: natural language processing, topic modeling Self-introduction Who am I? What does our analytics team do for Asia-Pacific? Project #2 Deep learning-based answering bot for call center Python lib: machine learning related such as tensorflow, keras, sklearn, numpy, and etc. Improving User Experience with Text Mining and Deep Learning in Uber Table of contents Improving User Experience with Text Mining and Deep Learning in Uber
  • 39. Product owner: Huaixiu Zheng and Yichia Wang, Hugh Williams in Uber’s Applied Machine Learning team Project #2: Artificial Intelligence revolution in call centers
  • 40. CSR’s sample workflow to respond user in a call center How does our users submit an issue?
  • 41. CSR’s sample workflow to respond user in a call center Online support via in-app-help
  • 42. The issue for call center operation: scalability and cost The growth comes at a price again….
  • 43. Solution? Let’s start from a basic sample “I want to change my rating for a rider” - very rule-based deterministic flow
  • 44. The business impact of a simple bot-solving solution 3k+ weekly solves A team of 18 CSR 28k USD monthly
  • 45. What’s the problem with this solution? “Scalability”
  • 46. The difference between Programming and Machine Learning
  • 47. Our machine learning solution design Why go with “Semi-automated” assistance rather than real robot? Product designed by Hugh Williams, Huaixiu Zheng, Yi-Chia Wang in Applied Machine Learning team
  • 48. Our machine learning solution design ‘Assistant to CSR’ - Provide suggestions for reply and actions Issue category/ type suggestion Action suggestion 10M+ tickets Correct response from agents to these 10M+ tickets Technical model training Product design
  • 49. Typical Machine Learning process Note: picture from “Mark Peng’s “General Tips for participating Kaggle Competitions” on Slideshare
  • 50. Typical Machine Learning process Model selection ML 101: Start with simple model first Data source:
  • 51. Deep Learning Architecture Reference: Uber AML Lab: 1000+ multiclass problem 10M English tickets (10-day)
  • 52. Deep Learning Architecture Reference: Uber AML Lab: Sample code with Keras for a simple CNN
  • 53. Deep Learning Architecture Reference: Uber AML Lab: arXiv Paper: COTA: Improving the Speed and Accuracy of Customer Support through Ranking and Deep Networks (link) CNN: Max pooling Optimizers: Adam (SGD, RMSProp), Batch Normalization Regularization: L2 Reg, Dropout, Batch Normalization, early stopping
  • 54. Development environment for Deep learning model training How does model training look like? >> DEMO Main codebase + data set GRID K520
  • 55. Feature engineering and feature importance Trade off between capacity and interpretability “Capacity” “Interpretability”
  • 56. Feature engineering and feature importance What are the important features? Very easy to learn that in simpler model
  • 57. Feature engineering and feature importance What are the important features? Very easy to get explanation in simpler models
  • 58. Feature engineering and feature importance What are the important features? NN model is like our brain’s intuition … blackbox
  • 59. Feature engineering process What are the important features? Trick: 資料量太大, 重新跑模型很久 → 把”測試資料”裡面的一個個feature打亂以快速得到結果
  • 60. Feature engineering and feature importance What are the important features? Sklearn: Recursive feature elimination (sklearn.feature_selection.RFE) Mockup dataset
  • 61. Feature engineering and feature importance What are the important features? Time on model training >>> prediction Shuffle each feature to create noise…. on the testing set Mockup dataset
  • 62. Feature engineering and feature importance What are the important features? Shuffle each feature to create noise…. on the testing set Mockup example
  • 63. Why NumPy is faster? Python Vectorization: Single Instruction, Multiple Data (SIMD)
  • 64. Why NumPy is faster? Python Vectorization: Single Instruction, Multiple Data (SIMD)
  • 65. Why NumPy is faster? Python Vectorization: Locality of reference (Spatial Locality) Java/ C++ versus Python…...
  • 66. Issue category suggestion Action suggestion Product design Last stop: making business Impact Ensure KPI measurement is well-planned in the beginning
  • 67. Last stop: making business Impact Identify key business metrics, and cautiously conduct and monitor experiments Source: Experiment notes: * Network effect → Switch back instead of A/B test * Guardrail variable and decision variable (risk control) * Monitoring versus peeking * Novelty effect
  • 68. Other leanings How to become a better programmer, or data scientist?
  • 69. Other leanings How to become a better programmer, or data scientist? ● Long-term growth: Not just know how to call APIs → ○ Understand what’s happening beneath (math and low-level manipulation are key) ○ Understand pros and cons of your tool/ model/ framework choice ● Coding at scale: Resource and infra are rich, but data is also huge (as well as the risk) → time and space optimization optimization but not overdesign ● Communication: Everybody is busy → organize and communicate your work well, and build good social relationship
  • 70. Recommended reading 推薦閱讀 How to become a better programmer, or data scientist? 多看書,多寫扣,多分享 Data Science from Scratch: 用python學資料科學 這本很推薦,也可以嘗試看原文的版本 Python資料運算與分析實戰 Numpy, Scipy, Pandas 日本人寫程式的書也很厲害... 流暢的Python Java我的推薦聖經是Effective Java 這本可能還沒到那個程度,但也推薦!
  • 71. Recommended reading How to become a better programmer, or data scientist? Read & Code & Share, and repeat Machine Learning and Deep Learning with Python Focus on scikit-learn and TensorFlow Data Science from Scratch Highly recommend: Python-based hand-in-hand On classical concepts and algorithms
  • 72. Paul Lo Data Analytics @ Uber paul.lo *a*t | paullo0106 a*t* Q&A