Taming Text

Taming Text

Grant Ingersoll
CTO, LucidWorks
@tamingtext, @gsingers

About the Book
• Goal: An engineer’s guide to search and Natural
Language Processing (NLP) and Machine Learning
• Target Audience: You
• All examples in Java, but concepts easily ported
• Covers:
– Search, Fuzzy string matching, human language basics,
clustering, classification, Question Answering, Intro to
advanced topics

Answer Me This!
• What is trimethylbenzene?
– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp
e=qa&qa=true&qa.qf=body

• who is ten minute warning?
– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy
pe=qa&qa=true&qa.qf=body

• what station serves the A train?
– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&
defType=qa&qa=true&qa.qf=body

What does it take to build this system?

Agenda
• Question Answering In Detail
– Building Blocks
– Indexing
– Search/Passage Retrieval
– Classification
– Scoring

• Other Interesting Topics
– Clustering
– Fuzzy-Wuzzy Strings
• What’s next?
• Resources

A Grain of Salt
• Text is a strange and magical world filled
with…
– Evil villains
– Jesters
– Wizards
– Unicorns
– Heroes!
• In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

The Ugly Truth
• You will spend most of your time in NLP, search,
etc. doing “grunt” work nicely labeled as:
– Preprocessing
– Feature Selection
– Sampling
– Validation/testing/etc.
– Content extraction
– ETL
• Corollary: Start with simple, tried and true
algorithms, then iterate

Getting Started
• git clone git@github.com:tamingtext/book.git
• See the README for pre-requisites
• ./bin contains useful scripts to get started
• You’ll need to download some pretty big
dependencies:
– OpenNLP Models
– WordNet
– Wikipedia subset

What is QA?
• You’ve seen QA in action
already thanks to IBM and
Jeopardy! 

• Instead of providing 10 blue
links, provide the answer!

• Exercises many search and
NLP features
• See Ch. 8

Building Blocks
• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2

QA in Taming Text
• Apache Solr for Passage Retrieval and
integration
• Apache OpenNLP for sentence detection,
parsing, POS tagging and answer type
classification
• Custom code for Query Parsing, Scoring
– See com.tamingtext.qa package
• Wikipedia for “truth”

Demo
• $TT_HOME/bin/start-solr.sh solr-qa
– http://localhost:8983/solr/answer
• Once that is up and running
– $TT_HOME/bin/indexWikipedia.sh --wikiFile
~/projects/manning/maven.tamingtext.com/freeb
ase-wex-2011-01-18-articles-first10k.tsv
• When done, you can ask questions!

Indexing
• Ingest raw data into the system and make it
available for search
• Garbage In, Garbage Out
– Need to spend some time understanding and
modeling your data just like you would with a DB
– Lather, rinse, repeat
• See the $TT_HOME/apache-solr/solr-
qa/conf/schema.xml for setup
• WikipediaWexIndexer.java for indexing code

Aside: Named Entity Recognition

• NER is the process of extracting proper names, etc.
from text
• Plays a vital role in a QA and many other NLP systems
• Often solved using classification approaches

• Custom Query Parser takes in user’s natural
language query, classifies it to find the Answer
Type and generates Solr query
• Retrieve candidate passages that match
keywords and expected answer type
• Unlike keyword search, we need to know
exactly where matches occur

Answer Type Classification
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time
Point (T), Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of
previously annotated questions, e.g.:
– P Which French monarch reinstated the divine
right of the monarchy to France and was known as
`The Sun King' because of the splendour of his
reign?

Other Areas of NLP/Machine
Learning

Clustering
• Group together content based
on some notion of similarity
• Book covers (ch. 6):
– Search result clustering using
Carrot2
– Whole collection clustering using
Mahout
– Topic Modeling
• Mahout comes with many
different algorithms

Clustering Use Cases
• Google News

• Outlier detection in smart grids

• Recommendations
– Products
– People, etc.

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Fuzzy-Wuzzy Strings

• Fuzzy string matching is a common, and difficult,
problem
• Useful for solving problems like:
– Did you mean spell checking
– Auto-suggest
– Record linkage

Common Approaches
• See com.tamingtext.fuzzy package
• Jaccard
– Measure character overlap
• Levenshtein (Edit Distance)
– Count the number of edits required to transform
one word into the other
• Jaro-Winkler
– Account for position

Trie
• The Trie is a very useful
data structure for working
with strings
• Find common
subsequences
• Auto-suggest, others

• Ternary Search Trie

Much Harder Problems
• Chapter 9
• Semantics, Pragmatics and beyond
• Sentiment Analysis
• Document and collection summarization
• Relationship Extraction
• Cross-language Search
• Importance

Thank You!

• 3 copies of Taming Text

Resources
• http://www.manning.com/in
gersoll
– http://github.com/tamingtext/
book
• http://www.tamingtext.com
• @tamingtext
• Me:
– @gsingers
– grant@lucidworks.com

Taming Text

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Taming Text

Similar to Taming Text (20)

More from Grant Ingersoll

More from Grant Ingersoll (10)

Recently uploaded

Recently uploaded (20)

Taming Text