Taming Text

Taming Text

Grant Ingersoll
CTO, LucidWorks
@tamingtext, @gsingers

About the Book
• Goal: An engineer’s guide to search and Natural
Language Processing (NLP) and Machine Learning
• Target Audience: You
• All examples in Java, but concepts easily ported
• Covers:
– Search, Fuzzy string matching, human language basics,
clustering, classification, Question Answering, Intro to
advanced topics

Answer Me This!
• What is trimethylbenzene?
– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp
e=qa&qa=true&qa.qf=body

• who is ten minute warning?
– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy
pe=qa&qa=true&qa.qf=body

• what station serves the A train?
– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&
defType=qa&qa=true&qa.qf=body

What does it take to build this system?

Agenda
• Question Answering In Detail
– Building Blocks
– Indexing
– Search/Passage Retrieval
– Classification
– Scoring

• Other Interesting Topics
– Clustering
– Fuzzy-Wuzzy Strings
• What’s next?
• Resources

A Grain of Salt
• Text is a strange and magical world filled
with…
– Evil villains
– Jesters
– Wizards
– Unicorns
– Heroes!
• In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

The Ugly Truth
• You will spend most of your time in NLP, search,
etc. doing “grunt” work nicely labeled as:
– Preprocessing
– Feature Selection
– Sampling
– Validation/testing/etc.
– Content extraction
– ETL
• Corollary: Start with simple, tried and true
algorithms, then iterate

Getting Started
• git clone git@github.com:tamingtext/book.git
• See the README for pre-requisites
• ./bin contains useful scripts to get started
• You’ll need to download some pretty big
dependencies:
– OpenNLP Models
– WordNet
– Wikipedia subset

What is QA?
• You’ve seen QA in action
already thanks to IBM and
Jeopardy! 

• Instead of providing 10 blue
links, provide the answer!

• Exercises many search and
NLP features
• See Ch. 8

Building Blocks
• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2

QA in Taming Text
• Apache Solr for Passage Retrieval and
integration
• Apache OpenNLP for sentence detection,
parsing, POS tagging and answer type
classification
• Custom code for Query Parsing, Scoring
– See com.tamingtext.qa package
• Wikipedia for “truth”

Demo
• $TT_HOME/bin/start-solr.sh solr-qa
– http://localhost:8983/solr/answer
• Once that is up and running
– $TT_HOME/bin/indexWikipedia.sh --wikiFile
~/projects/manning/maven.tamingtext.com/freeb
ase-wex-2011-01-18-articles-first10k.tsv
• When done, you can ask questions!

Indexing
• Ingest raw data into the system and make it
available for search
• Garbage In, Garbage Out
– Need to spend some time understanding and
modeling your data just like you would with a DB
– Lather, rinse, repeat
• See the $TT_HOME/apache-solr/solr-
qa/conf/schema.xml for setup
• WikipediaWexIndexer.java for indexing code

Aside: Named Entity Recognition

• NER is the process of extracting proper names, etc.
from text
• Plays a vital role in a QA and many other NLP systems
• Often solved using classification approaches

• Custom Query Parser takes in user’s natural
language query, classifies it to find the Answer
Type and generates Solr query
• Retrieve candidate passages that match
keywords and expected answer type
• Unlike keyword search, we need to know
exactly where matches occur

Answer Type Classification
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time
Point (T), Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of
previously annotated questions, e.g.:
– P Which French monarch reinstated the divine
right of the monarchy to France and was known as
`The Sun King' because of the splendour of his
reign?

Other Areas of NLP/Machine
Learning

Clustering
• Group together content based
on some notion of similarity
• Book covers (ch. 6):
– Search result clustering using
Carrot2
– Whole collection clustering using
Mahout
– Topic Modeling
• Mahout comes with many
different algorithms

Clustering Use Cases
• Google News

• Outlier detection in smart grids

• Recommendations
– Products
– People, etc.

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Fuzzy-Wuzzy Strings

• Fuzzy string matching is a common, and difficult,
problem
• Useful for solving problems like:
– Did you mean spell checking
– Auto-suggest
– Record linkage

Common Approaches
• See com.tamingtext.fuzzy package
• Jaccard
– Measure character overlap
• Levenshtein (Edit Distance)
– Count the number of edits required to transform
one word into the other
• Jaro-Winkler
– Account for position

Trie
• The Trie is a very useful
data structure for working
with strings
• Find common
subsequences
• Auto-suggest, others

• Ternary Search Trie

Much Harder Problems
• Chapter 9
• Semantics, Pragmatics and beyond
• Sentiment Analysis
• Document and collection summarization
• Relationship Extraction
• Cross-language Search
• Importance

Thank You!

• 3 copies of Taming Text

Resources
• http://www.manning.com/in
gersoll
– http://github.com/tamingtext/
book
• http://www.tamingtext.com
• @tamingtext
• Me:
– @gsingers
– grant@lucidworks.com

Taming Text

More Related Content

What's hot

Viewers also liked

Similar to Taming Text

More from Grant Ingersoll

Recently uploaded

Taming Text