Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 Lexalytics Inc. All rights reserved
Discovery++
Clustering + Text Analytics
Seth Redmore; CMO, Lexalytics, Inc.
@sr...
© 2015 Lexalytics Inc. All rights reserved
Agenda
 Who is Lexalytics
 What’s our stack looks like
 How to fit Machine L...
© 2015 Lexalytics Inc. All rights reserved
Who is Lexalytics? 3
• Founded in 2003
• Text Analytics Engine
– Entities, Sent...
© 2015 Lexalytics Inc. All rights reserved
Layers of Interpretation: Transparent Deep Learning
Sentence
Breaking
Tokenizat...
© 2015 Lexalytics Inc. All rights reserved
The Discovery Problem vs. The Prediction Problem 5
• Two obvious ways to integr...
© 2015 Lexalytics Inc. All rights reserved
Text and why it’s annoying 6
• Medium dimensionality
– As compared to:
• Video:...
© 2015 Lexalytics Inc. All rights reserved
Discovery Process – Cluster then Extract 7
• Clustering allows us to discover n...
© 2015 Lexalytics Inc. All rights reserved
Themes
House and Senate leaders hatched their plans Thursday to
avoid a politic...
© 2015 Lexalytics Inc. All rights reserved
Themes 9
Algorithm
Scoring
Patterns
Candidate
Themes
Tuning
Theme Candidate PoS...
© 2015 Lexalytics Inc. All rights reserved
Clustering 10
• H2O supports k-means clustering
• k-means clustering:
– Find n ...
© 2015 Lexalytics Inc. All rights reserved
Datasets 11
• 2 test datasets:
– ~10k tweets from New Hampshire
that talk about...
© 2015 Lexalytics Inc. All rights reserved
Challenges in Clustering 12
• Dimensionality vs. Sparseness
• We tried clusteri...
© 2015 Lexalytics Inc. All rights reserved
Reducing Dimensions (and improving sparseness) 13
• Principal Component Analysi...
© 2015 Lexalytics Inc. All rights reserved
Word2Vec 14
• Word2Vec is an open-source toolset for
– calculating the cosine d...
© 2015 Lexalytics Inc. All rights reserved
Clustering on Word2Vec processed content 15
• Yay! We’re not getting one big cl...
© 2015 Lexalytics Inc. All rights reserved
16Politics-30-split: Cluster 14 Size = 305, Sentiment = -0.31
Bigrams
• #alpoli...
© 2015 Lexalytics Inc. All rights reserved
17Politics-30: Cluster 14 Size = 305, Sentiment = -0.31
Entities
• Bush(-9.89)
...
© 2015 Lexalytics Inc. All rights reserved
18Politics-100: Cluster 37, Size = 213, Sentiment = -0.38
Entities
• AMNESTY(-9...
© 2015 Lexalytics Inc. All rights reserved
19Politics-30: Cluster 25 Size = 407, Sentiment = +0.27
Terms
• #11
• #2016elec...
© 2015 Lexalytics Inc. All rights reserved
20Politics-30-split: Cluster 25 Size = 407, Sentiment = +0.27
Entities
• @ThisW...
© 2015 Lexalytics Inc. All rights reserved
Samsung-30 Interesting Clusters (Themes Only)
Cluster 5, Size = 50, Sentiment =...
© 2015 Lexalytics Inc. All rights reserved
What else could we have done? 22
• Different cluster sizes
• Semantic meaning o...
© 2015 Lexalytics Inc. All rights reserved
Human/Computer Partnership 23
Loop if broken
Text Content
Entities
Sentiment
Th...
© 2015 Lexalytics Inc. All rights reserved
Summary 24
• Text Analytics relies heavily on machine learning to do its job
• ...
© 2015 Lexalytics Inc. All rights reserved
Thanks!
• H2O for providing us with all the processing power we needed and exce...
© 2015 Lexalytics Inc. All rights reserved
Upcoming SlideShare
Loading in …5
×

H2O World - Clustering & Feature Extraction on Text - Seth Redmore

2,764 views

Published on

H2O World 2015 - Seth Redmore with Lexalytics
Clustering & Feature Extraction on Text w/H2O & Lexalytics

Published in: Software
  • Be the first to comment

H2O World - Clustering & Feature Extraction on Text - Seth Redmore

  1. 1. © 2015 Lexalytics Inc. All rights reserved Discovery++ Clustering + Text Analytics Seth Redmore; CMO, Lexalytics, Inc. @sredmore Paul Barba, Senior Architect, Lexalytics, Inc.
  2. 2. © 2015 Lexalytics Inc. All rights reserved Agenda  Who is Lexalytics  What’s our stack looks like  How to fit Machine Learning and Text Analytics together  Text and its annoying challenges  Clustering and extraction process  Interesting results  What else could we have done?  Human/Computer Partnership 2
  3. 3. © 2015 Lexalytics Inc. All rights reserved Who is Lexalytics? 3 • Founded in 2003 • Text Analytics Engine – Entities, Sentiment, Themes, Summaries, Intentions, Categories • On-Premise, SaaS, Desktop • Popular in Social Listening, Customer Experience Mgmt. • Billions of documents/day processed across our customers • Hybrid approach to text analytics using machine learning, natural language processing algorithms, pattern files, and dictionaries • Fun fact: We maintain almost 40 different machine learning models
  4. 4. © 2015 Lexalytics Inc. All rights reserved Layers of Interpretation: Transparent Deep Learning Sentence Breaking Tokenization Lexical Chaining PoS Chun k Syntax Base Knowledge Syntax Matrix i Vertical Optimization , Concept Matrix Multi-layered Text Deconstruction (Text Preparation) IntentionsThemesEntities Feature Extraction Sentiment +/- Summaries 3 Categories 4
  5. 5. © 2015 Lexalytics Inc. All rights reserved The Discovery Problem vs. The Prediction Problem 5 • Two obvious ways to integrate NLP and Machine Learning • Learn, then NLP  Discovery • NLP first, then Learn Predictions • We decided to give the first one a try, as that’s often the first question an analyst needs to know about text. • “Ok, I just got 500k tweets dumped on me and I need to understand what’s up.” • Once some degree of “importance” is measured, then easy to integrate into predictive models vs.
  6. 6. © 2015 Lexalytics Inc. All rights reserved Text and why it’s annoying 6 • Medium dimensionality – As compared to: • Video: gazillions of dimensions • Netflix rating data: – Lots of users, bunch of movies dimensions, also sparse – Users * Movies • But very sparse across those dimensions – Of the say ~100,000-200,000 lemmas that come up with reasonable frequency, how many are you getting in your corpus? http://www.oxforddictionaries.com/us/words/the-oec-facts-about-the-language
  7. 7. © 2015 Lexalytics Inc. All rights reserved Discovery Process – Cluster then Extract 7 • Clustering allows us to discover naturally occurring groupings of text • Post-clustering, we will then extract the features from the clusters to see what’s in them – Terms – Bigrams – Trigrams – Themes – Entities – Sentiment Themes EntitiesSentiment +/- Themes EntitiesSentiment +/- Themes EntitiesSentiment +/-
  8. 8. © 2015 Lexalytics Inc. All rights reserved Themes House and Senate leaders hatched their plans Thursday to avoid a politically risky shutdown next week, moving to separate an acrimonious battle over abortion from a must- pass bill to keep government agencies open. After Pope Francis addressed a joint meeting of Congress, Speaker John Boehner told his leadership team he would immediately put a plan to defund Planned Parenthood into legislative vehicle known on Capitol Hill as "reconciliation," which cannot be filibustered in the Senate. The speaker's team argues that by putting the provision in a reconciliation bill, there's a good chance it will be approved in both chambers of Congress and it will force Obama to use his veto pen. It would also allow them to pass a stop-gap measure free of Planned Parenthood restrictions before the Oct. 1 deadline to keep the government open. The move is bound to anger conservatives, and Boehner will pitch the plan Friday morning to a closed-door conference meeting. Extracted themes Sentiment anger conservatives -2.07 risky shut down -3.82 acrimonious battle -4.50 must-pass bill -2.32 good chance +3.00 Themes example 8
  9. 9. © 2015 Lexalytics Inc. All rights reserved Themes 9 Algorithm Scoring Patterns Candidate Themes Tuning Theme Candidate PoS Patterns Scored ThemesT Text PrepText
  10. 10. © 2015 Lexalytics Inc. All rights reserved Clustering 10 • H2O supports k-means clustering • k-means clustering: – Find n centerpoints upon which the distance between members of the cluster are minimized (“Within Cluster Sum of Squares” – WCSS) • k-means can be solved in reasonable time with fixed dimensionality and number of clusters • 3 steps: – Decide what you’re going to cluster on – Initialize the set – k-means++ – Run some sort of optimized algorithm
  11. 11. © 2015 Lexalytics Inc. All rights reserved Datasets 11 • 2 test datasets: – ~10k tweets from New Hampshire that talk about the current election cycle – 20,000 tweets from a Samsung® announcement • We want to know if there are any interesting, natural groupings in the content that we should be aware of.
  12. 12. © 2015 Lexalytics Inc. All rights reserved Challenges in Clustering 12 • Dimensionality vs. Sparseness • We tried clustering on: – Terms (single words) (stemmed + unstemmed) – Bigrams (stemmed + unstemmed) – Themes (stemmed and unstemmed) • Each one got a single mega-cluster • Data is too high-dimensional and sparse – need to reduce dimensions
  13. 13. © 2015 Lexalytics Inc. All rights reserved Reducing Dimensions (and improving sparseness) 13 • Principal Component Analysis (PCA) is native to H2O, so we tried that first. – PCA reduces dimensionality by first finding the “principle component” that accounts for the most variability. – Then, it finds the component that has the next largest variability – with the constraint that this component must be orthogonal to the first component – Lather, rinse, repeat • PCA ran for over a week on the fairly hefty cluster we were given to use, then went down. • PCA is thus too slow for this problem
  14. 14. © 2015 Lexalytics Inc. All rights reserved Word2Vec 14 • Word2Vec is an open-source toolset for – calculating the cosine distance between words – categorizing words based on a training corpus • You can train it yourself on your own corpora, or can use some of the pre-trained Word2Vec models out there already (see below) • The cosine distance can be used to reduce the dimensionality by grouping words into an arbitrary number of dimensions • We used 300, because This Is SPARTAAAAA! – Actually because we used the pre-existing Google model that had 300 vectors in it already – https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21p QmM/edit?usp=sharing https://code.google.com/p/word2vec/
  15. 15. © 2015 Lexalytics Inc. All rights reserved Clustering on Word2Vec processed content 15 • Yay! We’re not getting one big cluster any more. • Now we need to figure out how many clusters are optimal – Remember, we’re just doing discovery here, so, we don’t have to spend a lot of time optimizing – We tried 8, 30, and 100 clusters
  16. 16. © 2015 Lexalytics Inc. All rights reserved 16Politics-30-split: Cluster 14 Size = 305, Sentiment = -0.31 Bigrams • #alpolitics #iacaucus • #alpolitics #tennessee • #alpolitics #ukip • #alpolitics @anncoulter • #alpolitics @realdonaldtrump • #gopdebate #nhpolitics • #iacaucus #alpolitics • #iacaucus #ukip • #iacaucus @anncoulter • #iacaucus @realdonaldtrump Trigrams • #alpolitics @anncoulter #tennessee • #alpolitics @anncoulter @vdare • #alpolitics @realdonaldtrump #ukip • #iacaucus #alpolitics #tennessee • #iacaucus #alpolitics #ukip • #iacaucus #alpolitics @anncoulter • #iacaucus #alpolitics @realdonaldtrump • #iacaucus @anncoulter #alpolitics • #iacaucus @anncoulter @vdare • #iacaucus @realdonaldtrump #ukip Terms #GOPDebate #Immigration #NHGOP #TPP #UKIP #alpolitics #fitn #iacaucus #immigration #nhpolitics
  17. 17. © 2015 Lexalytics Inc. All rights reserved 17Politics-30: Cluster 14 Size = 305, Sentiment = -0.31 Entities • Bush(-9.89) • Marco Rubio(-16.83) • Hillary Clinton(-10.42) • Mexico(-0.63) • @AnnCoulter(-4.44) • AMNESTY(-8.94) • @realDonaldTrump(-4.89) • @BruceBourgoine(-0.23) • Mass(-0.88) • Libya(-0.23) Themes • open borders mass immigration(1.87500047684) • wage-reducing mass immigration(-6.61054801941) • nation-wrecking mass immigration(-4.53758764267) • alien invaders(-18.000005722) • legal immigration(-7.35771656036) • rancid whores(-8.327501297) • job-killing trade deal scams(-3.09999990463) • Trans-Pacific Partnership trade deal scam(-0.490000009537) • multicultural mayhem(-3.75) • treasonous rat(-3.75)
  18. 18. © 2015 Lexalytics Inc. All rights reserved 18Politics-100: Cluster 37, Size = 213, Sentiment = -0.38 Entities • AMNESTY(-9.45) • @realDonaldTrump(-5.75) • @elraymer(0.00) • @MartinOMalley(0.00) • Martin O'Malley(-1.14) • Maryland(-1.00) • @RefugeeWatcher(0.00) • @vaughnFNC(-1.20) • Hillary Clinton(-0.72) • @CampaignReg(-0.13) Themes • illegal aliens(-4.00000047684) • wage-reducing mass immigration(-2.23903226852) • nation-wrecking mass immigration(-1.57310771942) • multicultural mayhem(-2.25) • alien invaders(-1.20000004768) • open border(0.450000017881) • big money donors(0.882178008556) • sanctuary state(-0.483333349228) • immigrant crime(-1.20319712162) • visa foreigners(-0.600000023842)
  19. 19. © 2015 Lexalytics Inc. All rights reserved 19Politics-30: Cluster 25 Size = 407, Sentiment = +0.27 Terms • #11 • #2016election • #3 • #603forHRC • #911Anniversary • #ACEs • #Bernie2016 • #BernieAtUNH • #Brooklyn • #CNN Bigrams • #11 candidate • #603forhrc #hillary2016 • #603forhrc #newhampshire • #603forhrc @hillaryfornh • #603forhrc together • #bernie2016 #feelthebern • #brooklyn today • #carly2016 #fitn • #carly2016 #nhgop • #carly2016 listen Trigrams • #11 candidate i • #603forhrc #hillary2016 http • #carly2016 #fitn #nhgop • #carly2016 #fitn #nhpolitics • #climateactionnow thank @berniesanders • #cnndebate stage tonight • #delay #nh #nhpoli • #feelthebern #climateactionnow #stopthenhpipeline • #feelthebern #fitn #nhpolitics • #fitn #bernie2016 #feelthebern
  20. 20. © 2015 Lexalytics Inc. All rights reserved 20Politics-30-split: Cluster 25 Size = 407, Sentiment = +0.27 Entities • @ThisWeekABC(0.00) • @donnabrazile(0.27) • Senator Bernie Sanders(1.81) • @Women4Bernie(0.00) • NH(5.36) • @BernieSanders(4.17) • @CornelWest(0.00) • RI Gov Lincoln Chafee(0.00) • 4(1.10) • Wheeler Hall(0.00) Themes • race car start(0.0) • town hall meeting(0.0) • 2nd day(0.487500011921) • inviting folks(0.24375000596) • great day(0.40000000596) • Convention crowd cheers(0.980000019073) • clear winner(0.490000009537) • 17 town hall(0.0) • Living room(0.0) • state convention(0.0)
  21. 21. © 2015 Lexalytics Inc. All rights reserved Samsung-30 Interesting Clusters (Themes Only) Cluster 5, Size = 50, Sentiment =+0.28 • Android smartphone profits(8.37637424469) • filling pre-orders(-2.89171385765) • supply issues(-2.91585707664) • global supply shortages(-0.980000019073) • global rollout(-1.94809389114) • initial supplies(-1.94912362099) • mobile device market(5.76794099808) • global rollout(-0.895333886147) • Android profits(4.18818712234) • lion share(3.84502887726) Cluster 6, Size = 657, Sentiment =+0.32 • Limited edition(2.96772003174) • 2 cover case leather sleeve brown(0.0) • Rechargeable Power(0.147000014782) • waxed leather(0.0) • Cheap price(0.32262301445) • Soft Skin(0.475291997194) • S line(0.237645998597) • Assorted Colors(0.118822999299)
  22. 22. © 2015 Lexalytics Inc. All rights reserved What else could we have done? 22 • Different cluster sizes • Semantic meaning of the themes for associations • Pre-sorting based on queries of candidates, or topical queries • Gathering other examples for comparison • Building queries to pull out common items. – Look to see which clusters it’s appearing in, is it across all the clusters? • Demographic data, Klout scores
  23. 23. © 2015 Lexalytics Inc. All rights reserved Human/Computer Partnership 23 Loop if broken Text Content Entities Sentiment Themes Categories Reduce Dimensions Cluster Extract Examine Pick one lens Loop though to dive into one area or segment text by feautures, then classify E.G. “What are the clusters for each of the candidates or “I built classifiers for – Solar Power – Fossil Fuels – Wind”
  24. 24. © 2015 Lexalytics Inc. All rights reserved Summary 24 • Text Analytics relies heavily on machine learning to do its job • Text Analytics can come before other Machine Learning for predictive analysis • Machine Learning can come before Text Analytics for discovery processes • Reducing dimensionality of the text is an important step because of the sparse nature of the matrix • PCA was unsuitable (we used Word2Vec, but Sparse PCA might work as well) • For discovery, an interesting process is to loop, taking a lens built from the first run (entities, categories, etc), and then going back to step one and looking at the related clusters for that lens
  25. 25. © 2015 Lexalytics Inc. All rights reserved Thanks! • H2O for providing us with all the processing power we needed and excellent technical support and tools. We were very impressed with their responsiveness and professionalism • Paul Barba for doing the heavy lifting • Y’all for listening  • Happy Diwali Everyone!
  26. 26. © 2015 Lexalytics Inc. All rights reserved

×