Strata 2013: Text Analytics at Scale


Published on

Millions of users visit Intuit product portals every day. With web analytics, we know what user behavior looks like, but not why. By tapping into in-product search and social data, we began to understand the types of questions, pain points, and suggestions users have. This was made possible with text analytics, via unguided machine learning at scale.

Topic discovery was just the beginning though. Trending, segmentation, integration with clickstream data and association with business goals made voice of customer insights actionable. In this presentation, learn about:

Text analytics at Intuit (case study)
Building decision support around text analytics
Technical approach & scaling
Protecting data privacy
Open source & commercial solutions

Heather Wasserlein is a Senior Product Manager at Intuit, where she partners with Data Science to create data-driven New Business Initiatives. Prior to Intuit, Heather worked on advertising marketplaces and web content classification at Yahoo! Heather holds a Master’s degree in Mechanical Engineering from MIT.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In a digital world, businesses give customers many channels to communicate – throughout the end-to-end customer experience of shop, browse, buy, use, etc.Ideally, we’d “listen” equally well across all of these touch pointsYet, much of the analytics focus is either upstream (ex. Search engines) or downstream (ex. Social media)This provides insight into user intent and feedback, but misses very important insights into customer experience with your product and servicesFor example, site search, customer support channels (call centers, chat) and communities are valuable sources of insights Rather than wait for feedback on yelp or twitter, there’s an opportunity to be proactive and address customer questions during product useAlso, with many channels, there are many formats for dataA tweet doesn’t look like a blog postVoice data often gets converted to text (by a machine or an agent summarizing a call, for example)
  • It is not uncommon to see people trying to read through 1000’s of customer surveys, suggestions, etc., one of my first text analytics requirements sessions was with a sharp User Experience Designer who would spend her Friday afternoons reading as many feedback reports as possibleCEO’s often personally read a subset of emails from customers or listen in on support calls, our CEO doesThis is commendable, but doesn’t scale when you receive millions of communications every dayNor is it possible to keep up with ever changing topics – today’s customer questions could be completely different than yesterdays
  • While language has some structure, there is ambiguityWords have multiple meanings, different forms, and can be used in metaphor(ex. Can and can, tin can vs. we can. Colorful fish vs. let’s go fish vs. a fish out of water)In addition, we are human. We have our own unique way of saying things. Some of us are polite and punctuate. Others misspell and abbreviate.. Sometimes we share TMI, including our PII. With Text Analytics, all of our data gets thrown in the mix. The goal is to make sense of it all.
  • In order to accurately “summarize” text data, the trick is to count all related topics across the corpus
  • At the most basic level, we’re trying to understand the meaning of words – with uncertainty due to context, morphology, and accuracy (ex. Misspellings)More generally, we’re trying to understand user intent, sentiment, etc.Note: As documents become more verbose (ex. Blog is verbose, a tweet is sparse), the more linguistics can help. Linguistics – SoundsWords (literal meaning)Bi-grams, etc. (words that go together, like “new york”)Phrases (“who let the dogs out?”)Sentences and Part of Speech / POS (subject, object, noun, adj, verb, etc.)Context within a large block of textTerminology:CorpusDocuments (text data, could be a tweet, search query, blog entry, etc.) – called a “verbatim” if in the user’s wordsWords vs. tokensTopics / themes
  • Everyone has a particular writing (and speaking) styleSome people use some vocabulary more than othersI bet you could distinguish a paragraph from NYT vs. CosmopolitanStatistics can be used here – to find distributions for every word (ex. How many times is “the” used in general publications) and compare it to your writing (ex. Do you use “the” more than the average person)?Note: women use adjectives more than men
  • Taxes are complex, people have tons of questions from start to finish
  • Intuit also uses Clarabridge (rules based solution) for categorization of Support call logs and Radian6 for monitoring and sentiment analysis of social media The primary driver for unsupervised clustering of in-product search queries was to capture “emerging issues” – Things we couldn’t foresee ahead of time when building rules (ex. Bug introduced in a product launch, late legislation issues with IRS, etc.)Another benefit of unsupervised approaches is they don’t require human input or maintenance (low effort)
  • Numbers tell us what is happening, but not whyThis is where text completes the story.For example, you may see conversion going up or downBut, what’s driving this change?By looking at emerging issues (what people are talking about today), you can see if a bug was introduced in your recent launch, etc.Trending is also valuable – to determine if a particular topic is gaining strength or gone away (ex. after making a product enhancement)Segmentation enables you to see the types of questions new vs. returning users haveBetter yet, questions from non-convertersBut, unlike numeric data, where you can slice and dice results after aggregatingWith text, you get more accurate results if you segment before clusteringAre tax filers procrastinators? ;-) File extension is a perennial top theme the night before tax dayIntegrating text into “funnel analysis” was extremely valuable. Clickstream data tells us where users drop off,but not why. Verbatims helped pinpoint user pain points / road blocks. Resolving just one of these pain points was worth $5MAnalysis of adjectives provides directional gage for sentimentPerhaps a more accurate way to gage sentiment is to segment promoters from detractors and see what each group has to say
  • When I began working at Intuit 3 years ago, there were text analytics efforts centered around call logs and social.We used a rules-based categorization tool called Clarabridge to classify logs from call center agents.We obtained a data feed from facebook, twitter, blogs, etc. and evaluated results with radian6, a Sales Force tool.Both of these tools work well for their respective use cases, but we noticed a gap – we didn’t have a good way to detect emerging issues.Thus began a 3-year journey in unguided machine learning for automated topic discovery (ie. No human input required)..
  • Pre-processing is 90% of the solution – you can greatly reduce complexity by removing stop words, stemming, mapping synonyms, etc. Reduces the term-doc matrix.With a 30% sampling rate, we saw an equivalent set of “top themes” as with a complete, 100%, data setRules-based categorization scales linearly, butClustering is memory constrained, because everything is compared with everything else.., segmentation helps, because segments can be processed in parallelWith 64GB memory, clustering of 5 million searches took < 2 hrs,enabling next day reporting on yesterday’s clickstream by 9AMOptimizing upstream processes helps too Note: as text becomes more verbose, computation time slows, a lot. Using Part of Speech parsing to focus on nouns can help identify what a document is about, although you miss sentiment (adjectives). Rules based approaches, categorization based on keywords, is also easier. Depends what type of problem you are solving.
  • Strata 2013: Text Analytics at Scale

    1. 1. Text Analytics at Scale Listening to 45 Million Customers Heather Wasserlein, Intuit STRATA Hadoop World, Oct 30, 2013
    2. 2. We’ve all been here.. 2
    3. 3. On the phone with customer support 3
    4. 4. Can anyone hear me? 4
    5. 5. It’s extremely frustrating 5
    6. 6. Employees are eager to help So, why the gap? 6
    7. 7. Many touch points User Intent 7 User Feedback
    8. 8. Overwhelming data volumes You can read a few 1000 customer comments, but not millions. And, new themes come up every day.. 8
    9. 9. You can pull a “top 1000” list, but.. Is it telling you anything new? Actionable? Top hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice change password charged twice cancel Long tail 10 Tail print function not working new version of IE, error msg 87956 please call back at 555-555-5555
    10. 10. Insights often in the tail Top Needle-in-the-haystack problem – valuable details hidden in descriptive, tail verbatims hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 11 Tail
    11. 11. Related topics dispersed Top The “top 1000” can be misleading – the most common verbatims may not represent the most common themes hello help call login Mid password cant find pwd account multiple accounts print import error 5514 phone printing blank page phone number call customer sevice print function not working new version of IE, error msg 87956 change password charged twice cancel please call back at 555-555-5555 Long tail 12 Tail
    12. 12. What is text analytics? With numeric data, you can run summary stats summarizing textual data is more complex Statistics + Linguistics 13 You can mix and match various statistical and linguistic tools, depending on the problem
    13. 13. Example – forensic linguistics Same author? 14
    14. 14. Case Studies Applying text analytics to simple and complex problems at Travelocity, Yahoo! and Intuit 15
    15. 15. Travelocity search Where is Albekerke? San San San San Jose Jose, CA Jose, Costa Rica Jose Intl Airport NY NYC JFK New York, NY, USA NY, New York Grand Canyon Disneyland 16 Home
    16. 16. Travelocity search solution Finite set of airports, but many variations in search San Jose San Jose, CA San Jose International Mineta San Jose Airport San Josee Airport Silicon Valley SJC SJC Simple, but manually intensive solution – Mapping of all known search variations to relevant airport codes. Plus, sound-ex phonetic matching to catch unforeseen misspellings. “Rules-based” approach no statistics, minimal linguistics (sounds) 17
    17. 17. Yahoo! web site classification Is this site clean? Does it contain any illegal or sensitive content? alcohol tobacco drug online gambling violence or weapons adult content Does the web site meet advertiser standards? 18
    18. 18. Yahoo! web site classification solution Verbose, rapidly-changing data, but finite set of topics. 100,000’s of web sites in Y! and partner Ad Networks. Training data (human-labeled) 5K positive examples 30K negative examples Multiple approaches – Classifiers, keyword matching, image matching, and human-review process. 19 Supervised machine learning Pattern detection, phrases and contexts associated with finite set of “risk categories.” Emphasis on recall, catching true positives.
    19. 19. Intuit tax support Adjusted cost basis? 20
    20. 20. Intuit tax support solution Millions of questions daily, of all types. Google-like search, but often in natural language. PIN number Where can I find my PIN? Newly married, file jointly File married or separately? Home mortgage deduction Can I deduct my dog? Why is 1099-int import slow? Where’s my refund?? Solution – Clustering of site searches, topic “discovery”. 21 PIN file married deduct 1099int refund Unsupervised machine learning Statistics and linguistics. Part of speech tagging. Detection of words that “go together more often than not”. import
    21. 21. Results for 3 algorithms LDA (bag of words) File, free, taxes File, extension, get File, security, social Income, state, business Payment, state, filed State, refund, check Lingo (hierarchal clustering) File File 2012 File an extension File state Deduction Deduction car Deduction sales tax Deduction standard Custom (n-gram clustering) File extension Social security Business income Sales tax deduction Refund check Payment (in-house solution)
    22. 22. Words + numbers = insights Emerging Topics Funnel Analysis Refund deduct Late legislation File extension Error 576 etc. Enter w2 Import error.. Trending & (pre) Segmentation Taxes done! Sentiment 23
    23. 23. Use Cases Product Managers 1. User needs Customer Care 1. – Identify product enhancements – Rapidly diagnose product defects – Tune site search – Personalize content Common questions Marketing 1. – Train agents & staff appropriately 2. 3. – Address common questions to retain users – Segment by sentiment and empower promotors Emerging issues – Early insight to new issues Call routing Segment by VOC 2. Customer dialogue – Listen to feedback & respond 1:1 or 1:many
    24. 24. Our journey Site search & FAQ tuning 2 new products 100’s items enabled actioned, $10M’s X-functional value “VOC team” Scaled meets weekly Data volume grew, system crawled Emerging issues detection Science project Clustering 2M searches 2-day lag Vocal early adopters Y1 Proof of concept 25 Transfer from science to eng Y2 Productize Campaign to grow adoption to 15M searches, 1-day lag Report email Scaled to 30M searches, next day 9am SLA Viral adoption, 50+ users Y3 Scale..!
    25. 25. Scaling Reduce problem size 1. Pre-process – de-dup – remove PII, system generated info, etc. – remove stop words – map synonyms – stemming 2. Reduce data size – sample – segment – narrow time period – remove tail terms (cautiously) Add hardware 1. Add memory – text clustering is memory constrained – verbose text is harder 2. Distribute processes – rule-based categorization scales linearly – clustering of segments can be run in parallel – data sourcing – pre-processing Optimize algorithm 1. Tradeoffs & tuning – Choose approach to balance accuracy vs. performance – Tune algorithm parameters
    26. 26. Results 1. Faster time to insights 2. Better customer experience 3. $10’s millions in revenue Customer issues detected up to 1 week earlier Search is a leading indicator for call drivers – a canary in the coal mine Using text insights to tune search results improved relevancy Identifying users with common questions made it possible to personalize the experience VOC data + user behavior led to a whole new understanding of product use Detecting and resolving customer pain points generated $10’s of millions 27
    27. 27. Getting started? 1. Read a sample of verbatims + scope the problem – Topic discovery or known topics? – Sources of text and verbosity (few words, sentences, pages)? – Estimate data volumes and define SLA’s 2. Build vs. buy – Compare tools, build proofs of concept – Compare results relative to a “golden set” 3. Start small – One data source, non-verbose text, small volumes – 1000’s of documents for statistically valid results – Beta test reporting, QA topic-verbatim fit 4. Establish business processes – X-functional process to action insights, let reports go viral Scale and incorporate domain knowledge later (“phase 2”) 28
    28. 28. Long story short Listen. To everyone! Words + Numbers = Insights Apply the right tools for the job
    29. 29. Thank You! @heatherwater @IntuitInc 30
    30. 30. Appendix 31
    31. 31. “Home grown” Algorithm Unsupervised machine learning / clustering 1. Identify candidate phrases – Sparse: Identify all combinations of bi-grams, tri-grams, four-grams – Verbose: Use linguistic approaches to identify phrases • Split text into sentences + identify part-of-speech for each word (noun, adj, etc.) • Apply linguistic filters to parse candidate phrases (adj noun, verb adv, etc.) 2. Determine which phrases are “significant” – Count word frequencies and calculate likelihood ratios • L1 = words are independent, L2 = words are dependent • If L2 > L1, the words appear together more often than not 3. Cluster related topics – Represent n-grams and searches as vectors, calculate similarity (cosine distance), and cluster related topics when similarity > pre-defined threshold 4. Identify topic “title” 32 – Construct “title” representative of the cluster (ex. most common search)
    32. 32. What’s next for text at Intuit? 1. 2. 3. 4. Finalize evaluation of new algorithms (ex. Lingo3G, LDA, etc) Scale through distributed processing (ie. move to Hadoop) Support more types of text (ex. verbose) Continue to integrate topics & usage data for complete picture of end-to-end user experience 5. Provide text analytics as a service 6. Semantic search 7. Internationalization (future) 33