Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.
Taming Text Grant Ingersoll CTO, LucidWorks@tamingtext, @gsingers
About the Book• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning• Target Audience: You• All examples in Java, but concepts easily ported• Covers: – Search, Fuzzy string matching, human language basics, clustering, classification, Question Answering, Intro to advanced topics
Answer Me This!• What is trimethylbenzene? – http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp e=qa&qa=true&qa.qf=body• who is ten minute warning? – http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy pe=qa&qa=true&qa.qf=body• what station serves the A train? – http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F& defType=qa&qa=true&qa.qf=body
Agenda• Question Answering In Detail – Building Blocks – Indexing – Search/Passage Retrieval – Classification – Scoring• Other Interesting Topics – Clustering – Fuzzy-Wuzzy Strings• What’s next?• Resources
A Grain of Salt • Text is a strange and magical world filled with… – Evil villains – Jesters – Wizards – Unicorns – Heroes! • In other words, no system will be perfecthttp://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
The Ugly Truth• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as: – Preprocessing – Feature Selection – Sampling – Validation/testing/etc. – Content extraction – ETL• Corollary: Start with simple, tried and true algorithms, then iterate
Getting Started• git clone firstname.lastname@example.org:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big dependencies: – OpenNLP Models – WordNet – Wikipedia subset
Building Blocks• Sentence Detection• Part of Speech Tagging• Parsing• Ch. 2
QA in Taming Text• Apache Solr for Passage Retrieval and integration• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification• Custom code for Query Parsing, Scoring – See com.tamingtext.qa package• Wikipedia for “truth”
Demo• $TT_HOME/bin/start-solr.sh solr-qa – http://localhost:8983/solr/answer• Once that is up and running – $TT_HOME/bin/indexWikipedia.sh --wikiFile ~/projects/manning/maven.tamingtext.com/freeb ase-wex-2011-01-18-articles-first10k.tsv• When done, you can ask questions!
Indexing• Ingest raw data into the system and make it available for search• Garbage In, Garbage Out – Need to spend some time understanding and modeling your data just like you would with a DB – Lather, rinse, repeat• See the $TT_HOME/apache-solr/solr- qa/conf/schema.xml for setup• WikipediaWexIndexer.java for indexing code
Aside: Named Entity Recognition• NER is the process of extracting proper names, etc. from text• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches
• Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query• Retrieve candidate passages that match keywords and expected answer type• Unlike keyword search, we need to know exactly where matches occur
Answer Type Classification• Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King because of the splendour of his reign?
Clustering• Group together content based on some notion of similarity• Book covers (ch. 6): – Search result clustering using Carrot2 – Whole collection clustering using Mahout – Topic Modeling• Mahout comes with many different algorithms
Clustering Use Cases• Google News• Outlier detection in smart grids• Recommendations – Products – People, etc.
In Focus: K-Meanshttp://en.wikipedia.org/wiki/K-means_clustering
Fuzzy-Wuzzy Strings• Fuzzy string matching is a common, and difficult, problem• Useful for solving problems like: – Did you mean spell checking – Auto-suggest – Record linkage
Common Approaches• See com.tamingtext.fuzzy package• Jaccard – Measure character overlap• Levenshtein (Edit Distance) – Count the number of edits required to transform one word into the other• Jaro-Winkler – Account for position
Trie• The Trie is a very useful data structure for working with strings• Find common subsequences• Auto-suggest, others• Ternary Search Trie