Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taming Text


Published on

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See for more.

Published in: Technology
  • Be the first to comment

Taming Text

  1. 1. Taming Text Grant Ingersoll CTO, LucidWorks@tamingtext, @gsingers
  2. 2. About the Book• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning• Target Audience: You• All examples in Java, but concepts easily ported• Covers: – Search, Fuzzy string matching, human language basics, clustering, classification, Question Answering, Intro to advanced topics
  3. 3. Answer Me This!• What is trimethylbenzene? – http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp e=qa&qa=true&qa.qf=body• who is ten minute warning? – http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy pe=qa&qa=true&qa.qf=body• what station serves the A train? – http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F& defType=qa&qa=true&qa.qf=body
  4. 4. Fact-based QA Demo
  5. 5. What does it take to build this system?
  6. 6. Agenda• Question Answering In Detail – Building Blocks – Indexing – Search/Passage Retrieval – Classification – Scoring• Other Interesting Topics – Clustering – Fuzzy-Wuzzy Strings• What’s next?• Resources
  7. 7. A Grain of Salt • Text is a strange and magical world filled with… – Evil villains – Jesters – Wizards – Unicorns – Heroes! • In other words, no system will be perfect
  8. 8. The Ugly Truth• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as: – Preprocessing – Feature Selection – Sampling – Validation/testing/etc. – Content extraction – ETL• Corollary: Start with simple, tried and true algorithms, then iterate
  9. 9. Getting Started• git clone• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big dependencies: – OpenNLP Models – WordNet – Wikipedia subset
  10. 10. Question Answering (QA)
  11. 11. What is QA?• You’ve seen QA in action already thanks to IBM and Jeopardy! • Instead of providing 10 blue links, provide the answer!• Exercises many search and NLP features• See Ch. 8
  12. 12. Simple QA Workflow
  13. 13. Building Blocks• Sentence Detection• Part of Speech Tagging• Parsing• Ch. 2
  14. 14. QA in Taming Text• Apache Solr for Passage Retrieval and integration• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification• Custom code for Query Parsing, Scoring – See package• Wikipedia for “truth”
  15. 15. Demo• $TT_HOME/bin/ solr-qa – http://localhost:8983/solr/answer• Once that is up and running – $TT_HOME/bin/ --wikiFile ~/projects/manning/ ase-wex-2011-01-18-articles-first10k.tsv• When done, you can ask questions!
  16. 16. Indexing• Ingest raw data into the system and make it available for search• Garbage In, Garbage Out – Need to spend some time understanding and modeling your data just like you would with a DB – Lather, rinse, repeat• See the $TT_HOME/apache-solr/solr- qa/conf/schema.xml for setup• for indexing code
  17. 17. Aside: Named Entity Recognition• NER is the process of extracting proper names, etc. from text• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches
  18. 18. • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query• Retrieve candidate passages that match keywords and expected answer type• Unlike keyword search, we need to know exactly where matches occur
  19. 19. Answer Type Classification• Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King because of the splendour of his reign?
  20. 20. Scoring
  21. 21. Other Areas of NLP/Machine Learning
  22. 22. Clustering• Group together content based on some notion of similarity• Book covers (ch. 6): – Search result clustering using Carrot2 – Whole collection clustering using Mahout – Topic Modeling• Mahout comes with many different algorithms
  23. 23. Clustering Use Cases• Google News• Outlier detection in smart grids• Recommendations – Products – People, etc.
  24. 24. In Focus: K-Means
  25. 25. Fuzzy-Wuzzy Strings• Fuzzy string matching is a common, and difficult, problem• Useful for solving problems like: – Did you mean spell checking – Auto-suggest – Record linkage
  26. 26. Common Approaches• See com.tamingtext.fuzzy package• Jaccard – Measure character overlap• Levenshtein (Edit Distance) – Count the number of edits required to transform one word into the other• Jaro-Winkler – Account for position
  27. 27. Trie• The Trie is a very useful data structure for working with strings• Find common subsequences• Auto-suggest, others• Ternary Search Trie
  28. 28. What’s Next?
  29. 29. Much Harder Problems• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance
  30. 30. Thank You!• 3 copies of Taming Text
  31. 31. Resources• gersoll – book•• @tamingtext• Me: – @gsingers –