II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)


Published on

Published in: Software, Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - Search Technology, USA & Alan L. Porter - Georgia Tech, USA)

  1. 1. Organizing Data The Step Before Visualization Nils C. Newman Director New Business Development at Search Technology & UNU-MERIT Dr. Alan L. Porter Director R&D at Search Technology & Emeritus Professor, Georgia Tech
  2. 2. The way it was….. • You would read information and filter the data through your mental framework, enabling discovery and synthesis
  3. 3. The way it is now…. • Too much information for you to process readily by reading…
  4. 4. Enter Text Analysis… • If a computer can organize and present the data to you, then you can absorb more information faster than traditional reading
  5. 5. The challenge… • How can a computer look at a collection of information and turn those data into something organized - into a framework that you understand?
  6. 6. Two main issues to consider…. • Do you want to impose order on the data? • Do you want to let the data self-organize?
  7. 7. The choice is important because it drives the math Impose Order Self Organize LSAPCA TM SVM NLP AS/PI Roots in StatisticsRoots in Machine Learning
  8. 8. Within Machine Learning – resources impact the decision… Supervised training • Requires time and effort by subject matter expert(s) Unsupervised training • Requires suitable quantities of training material • Computationally expensive
  9. 9. Within Statistics – data drive the decision… Data Signal • Requires data with sufficiently strong signal and relatively low noise Data Homogeneity • Requires that the records be sufficiently consistent (record to record)
  10. 10. Data Quality can help you make the decision… High Noise Data Quality High Signal Supervised Machine Learning Unsupervised Machine Learning Statistics
  11. 11. But as with most things, it is never that easy..
  12. 12. Reality is usually an engineered hybrid approach Impose Order Self Organize
  13. 13. But the hybrid approach adds complexity • The hybrid elements make things somewhat confusing but provide capabilities to address issues: Known noise can be removed Signal can be amplified Steps can be hard-coded to reduce computational variability • As tool developers, we often hide these tweaks to make tools look simpler than they actually are
  14. 14. A hybrid example… • A core analytical approach in VantagePoint is a modified version of Principal Components Analysis (PCA) • We feed phrases created by a Natural Language Processing (NLP) algorithm into the PCA algorithm to self-organize data • So we are already using a hybrid system • However, a recently developed Topic Modeling (TM) algorithm looked like it would out-perform our PCA/NLP system • So we devised a series of tests pitting our PCA/NLP against TM
  15. 15. Round 1 • In round one, we compared our PCA/NLP approach to TM (Latent Dirichlet Allocation -- LDA) by analyzing a set of ~4,000 Dye-Sensitized Solar Cell (DSSC) abstracts • The LDA approach ran much faster, required less expertise to run, and gave reasonable results • However, this “bag of words” approach means that labeling the resulting clusters requires significant topical expertise • The PCA/NLP approach required more expertise to run but the results gave clearer answers (and reasonable cluster labels) • Judges’ Decision - Tie
  16. 16. Round 2 • In round two, we compared our PCA/NLP approach to several different TM approaches by analyzing a mixed set containing searches on 7 different topics • The results were judged on precision and recall • One particular TM approach worked really well • It out-performed our PCA approach and all other TM approaches • Judges’ Decision – TM variant a winner!
  17. 17. Round 3 • In round three, we tested the round two winner by analyzing a set of search results on similar topics • The results were encouraging but not as clear-cut as round two • Judges’ Decision – TM variant still a winner!
  18. 18. Round 4 • Not to be outdone by the TM team, our PCA team looked at the problem and decided that adding more tuning would be better than changing to TM • They layered multiple “simple” techniques together to create a new more powerful PCA hybrid • The super hybrid system includes up to 10 different steps embodied in a single process: • Stopword removal • Acronym identification • Common word removal • Term Pruning • Association rule based removal • Term consolidation • etc…
  19. 19. The result? • The fight is still ongoing but the improved PCA is looking to keep pace with TM while maintaining its dominance in Cluster naming • The VantagePoint “Cluster Suite + PCA” approach is certainly ahead in usability • We have the next bout scheduled for later this year Who is ready to Byte?
  20. 20. Why tell you all this? • I wanted to give you a little insight into how tool developers think • The recent explosive growth in algorithms means that we have a lot of different approaches from which to choose • The growth in computing power means we can operate at a scale unheard of a decade ago • We are driven to make the tools more effective and easier to use • However, doing so often makes tools more opaque to the user
  21. 21. What does all this mean to you? • There is no “one size fits all” when it comes to text analytics • Analytical techniques still need to be matched to your data and your problems • The state of the art is rapidly evolving • You need to have a good sense of what is going on “under the hood” of the tools that you use
  22. 22. Why bother? • Understanding a little about how your tools work is critical BEFORE you confound the situation by adding visualization on top the analysis • Otherwise, you have to take it on faith that what we are doing suits your analytical situation
  23. 23. Questions? Thank you!