Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Daniel Shank, Data Scientist, Talla at MLconf SF 2017

Getting Value Out of Chat Data:
Chat-based interfaces are increasingly common, whether as customers interacting with companies or as employees communicating with each other within an organization. Given the large number of chat logs being captured, along with recent advances in natural language processing, there is a desire to leverage this data for both insight generation and machine learning applications. Unfortunately, chat data is user-generated data, meaning it is often noisy and difficult to normalize. It is also mostly short texts and heavily context-dependent, which cause difficulty in applying methods such as topic modeling and information extraction.

Despite these challenges, it is still possible to extract useful information from these data sources. In this talk, I will be providing an overview of techniques and practices for working with chat-based user interaction data with a focus on machine-augmented data annotation and unsupervised learning methods.

Bio: Daniel Shank is a Senior Data Scientist at Talla, a company developing a platform for intelligent information discovery and delivery. His focus is on developing machine learning techniques to handle various business automation tasks, such as scheduling, polls, expert identification, as well as doing work on NLP. Before joining Talla as the company’s first employee in 2015, Daniel worked with TechStars Boston and did consulting work for ThriveHive, a small business focused marketing company in Boston. He studied economics at the University of Chicago.

  • Login to see the comments

Daniel Shank, Data Scientist, Talla at MLconf SF 2017

  1. 1. Getting Value Out of Chat Data WHAT TO DO WHEN YOUR DATA IS NOISY, SPARSE, AND SHORT 0
  2. 2. Introduction  Contact: 1
  3. 3. Talla  NLP for internal business use cases  Smart knowledge management  Hiring! 2
  4. 4. What is “Chat data?” USER2: USER3 do you have new new cal on your Talla account already? Looks like it’s not available for me yet. Would be nice if we could also get inbox support enabled since it’s so much better than gmail. cc USER1 USER3: USER2 I realized that after I typed this that I was using my personal gmail when I updated to the new changes. I looked on Talla and I didn’t see the same option to update to new calendar yet. USER4: USER2 I just enabled Inbox for our domain USER4: new calendar is set to letting google decide when to roll it out, but it looks like we can also enable it as an option now USER4: I've now set that to be available as well. These may take some time to show up USER1: USER2 its been enabled for awhile. USER1: (inbox) USER1: and the new calendar is enabled, soon as google decides you are allowed to have it. USER2: Thanks USER1 USER4 3
  5. 5. Things similar to chat data  Sequential interactions  Forum posts  Some email  IT ticketing system interactions  Short text  Associated with a user  Possibly directed at another user  Highly context dependent 4
  6. 6. Problems with chat Increasing number of data sources In theory contains lots of valuable information In practice data is unlabeled “Water, water, everywhere, but not a drop to drink.” 5
  7. 7. Goal: Issue detection and matching  People get help through chat platforms  Extract that data and automate the process  USER1’s interaction should help USER3! USER1: Hi, does anyone know if we have patriot’s day off? USER2: Yeah USER1, we do. USER1: Thanks! … USER3: Hey, do we get patriot’s day off? 6
  8. 8. Automating knowledge delivery  Find issues or questions that people have  Match new issues to pre-existing ones  Serve the appropriate response or answer  Extracting answers is very hard  Focus on matching and search 7
  9. 9. Overview  Jumpstart ML: Active Learning  Topic modeling  Dimensionality Reduction and Representations 8
  10. 10. Find questions and analyze  Use patterns to find questions Has ‘?’ token Has a question word  Not too hard  Good start for finding past issues 9
  11. 11. Problems with extracted questions  Most questions need context to understand. e.g.: “What is it?” ”Can I use her personal email?”  Intent varies: Want information Do this thing for me Huh? 10
  12. 12. Only some questions make sense out of context  “Who is she?” “What is that?” “Will that fix my computer?”  Anaphora—it, that  Pronouns—He, she, etc  “What day is it?”, “Where am I?”  Answer depends on time, person asking  Requires more involved data model 11
  13. 13. Questions have different intents  “Performative” – Please help me? ex:  hi can you please help me reset my 2 factor authentication on salesforce?  “Informational” – What is it?  what's the pl code?  “Navigational” – How do I do this?  how do i record a vidyo meeting? 12
  14. 14. Can we write special case rules?  Borderline cases  is there a way to find out the size of an hbase table? – User asks “Is there (a way…)” to get directions  can anyone tell me where i find the out of stock request report? – User asks someone to give them information  Many variants  Alternative is to label data and use supervised learning 13
  15. 15. We want to label data, but… Managing crowdworkers: Expensive Time consuming Can’t be used unless data is safely anonymous Will the model work afterwards? 14
  16. 16. Active Learning makes labeling more efficient  More value for your time  Can use with crowd workers or without  Good for chat: Models train fast Quick to annotate  Supervised learning with little labeled data Annotate Train/Predict Get data 15
  17. 17. How it works (roughly)  Annotate 𝐷0 ∈ 𝐷  Train your model on 𝐷0  Predict labels on remaining data (𝐷 − 𝐷0)  Choose more data, 𝐷1 ∈ 𝐷 − 𝐷0,  Choice of 𝐷1 is based on label predictions  Repeat  ???  Profit! Annotate Train/Predict Get data 16
  18. 18. Where we are  Jumpstart ML: Active Learning  Topic modeling  Dimensionality Reduction and Representations 17
  19. 19. More to data than questions or intent  What do people talk about?  What kind of issues are common?  Are there clear lines defining topics?  Finding problem areas  Strategic thinking about what to tackle 18
  20. 20. Know Your Data Read some of it (if you can) Learn the context Cluster and overview 19
  21. 21. Clustering or modeling chat topics  LDA, LSA, NMF, others  Human supervision necessary for interpretation (boo!)  Messages short, so chat is hard  Larger documents have broader topic distributions  We expect messages to be about fewer topics 20
  22. 22. Using LDA with Chat 𝜶 =. 𝟓 𝜶 =. 𝟏 𝜶 =. 𝟎𝟓 𝜶 = . 𝟎𝟑 know; does; link database; jermaine; running file; area; bank free; jermaine; database did; try; work online; palace; sorry mean; try; screen user; hi; email send; test; agent try; user; free did; ok; want client; server; user look; able; mean user; client; error error; server; user ok; did; update online; help; screen mean; app; does whats; agent; end mean; user; file hi; palace; property shall; working; process client; property; user online; user; change email; error; just emails; kelly; time online; user; update mandy; wrong; chance user; issue; want did; ok; property palace; live; test owner; end; invoice client; need; check ticket; whats; right run; right; check want; error; agent owner; report; password check; chloe; duncan emails; know; link live; palace; try 21
  23. 23. Where we are  Jumpstart ML: Active Learning  Topic modeling  Dimensionality Reduction and Representations 22
  24. 24. Why do dimensionality reduction?  We want to improve our supervised learning techniques  Chat data is even more sparse than many NL datasets  Good representations can help search and similarity models  Off the shelf representations are good  Off the shelf + custom representations are better 23
  25. 25. Setting up methods for learning  Word2vec, NMF, even LDA  Most methods equivalent*  Chat has no clear document barriers  Methods assume either continuous context or separate documents  Using messages as contexts  too sparse 24
  26. 26. Choosing a context  Representations are influenced by context choice  Figure out your goal  Choose context where words are associated in a way helpful for your goal  For our purposes: Words should be similar if they occur together in issues people have 25
  27. 27. Using a time-based context window  Window before each question  Problem statement and questions should be related USER2: Can I email this form, or do I have to print it out? USER1: You need to drop the form off in person USER2: OK, sure. USER1: Great. USER2: Where can I get access to the printers? … 26
  28. 28. Keywords are extracted from recent history USER2: Can I email this form, or do I have to print it out? USER1: You need to drop the form off in person USER2: OK, sure. USER1: Great. USER2: Where can I get access to the printers? … 27
  29. 29. Similarity from resulting representations  ‘printer’  ['printer', 'choice', 'fuji', 'xerox', 'settings', 'sequence', 'default', 'rollover', 'driver', 'takes', 'smaller', 'main', ]  ‘issue’  ['issue', 'resolved', 'helping', 'experiencing', 'companies', 'related', 'assuming', 'reported', 'double', 'site', 'saw', 'causing', 'understand', 'sorted', 'logging', 'heard’]  ‘ssh’  ['ssh', 'config', 'dhcp, 'ping', 'reconnect', 'jpg’, 'webconsole', 'coats', 'lab’, 'browsers', 'instances', 'bypass’] 28
  30. 30. Final Thoughts...  Tip of the iceberg  Understand how people interact  What information can we extract?  Can we escape our corpus? 29
  31. 31. Thank you everyone!  thanks ['heaps', 'great', 'perfect', 'fantastic',] 30