Taming Text

    Grant Ingersoll
   CTO, LucidWorks
@tamingtext, @gsingers
About the Book
• Goal: An engineer’s guide to search and Natural
  Language Processing (NLP) and Machine Learning
• Target Audience: You
• All examples in Java, but concepts easily ported
• Covers:
  – Search, Fuzzy string matching, human language basics,
    clustering, classification, Question Answering, Intro to
    advanced topics
Answer Me This!
• What is trimethylbenzene?
  – http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp
    e=qa&qa=true&qa.qf=body

• who is ten minute warning?
  – http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy
    pe=qa&qa=true&qa.qf=body

• what station serves the A train?
  – http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&
    defType=qa&qa=true&qa.qf=body
Fact-based QA Demo
What does it take to build this system?
Agenda
• Question Answering In Detail
   –   Building Blocks
   –   Indexing
   –   Search/Passage Retrieval
   –   Classification
   –   Scoring

• Other Interesting Topics
   – Clustering
   – Fuzzy-Wuzzy Strings
• What’s next?
• Resources
A Grain of Salt
   • Text is a strange and magical world filled
     with…
          – Evil villains
          – Jesters
          – Wizards
          – Unicorns
          – Heroes!
   • In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
The Ugly Truth
• You will spend most of your time in NLP, search,
  etc. doing “grunt” work nicely labeled as:
   –   Preprocessing
   –   Feature Selection
   –   Sampling
   –   Validation/testing/etc.
   –   Content extraction
   –   ETL
• Corollary: Start with simple, tried and true
  algorithms, then iterate
Getting Started
•   git clone git@github.com:tamingtext/book.git
•   See the README for pre-requisites
•   ./bin contains useful scripts to get started
•   You’ll need to download some pretty big
    dependencies:
    – OpenNLP Models
    – WordNet
    – Wikipedia subset
Question Answering (QA)
What is QA?
• You’ve seen QA in action
  already thanks to IBM and
  Jeopardy! 

• Instead of providing 10 blue
  links, provide the answer!

• Exercises many search and
  NLP features
• See Ch. 8
Simple QA Workflow
Building Blocks
• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2
QA in Taming Text
• Apache Solr for Passage Retrieval and
  integration
• Apache OpenNLP for sentence detection,
  parsing, POS tagging and answer type
  classification
• Custom code for Query Parsing, Scoring
  – See com.tamingtext.qa package
• Wikipedia for “truth”
Demo
• $TT_HOME/bin/start-solr.sh solr-qa
  – http://localhost:8983/solr/answer
• Once that is up and running
  – $TT_HOME/bin/indexWikipedia.sh --wikiFile
    ~/projects/manning/maven.tamingtext.com/freeb
    ase-wex-2011-01-18-articles-first10k.tsv
• When done, you can ask questions!
Indexing
• Ingest raw data into the system and make it
  available for search
• Garbage In, Garbage Out
  – Need to spend some time understanding and
    modeling your data just like you would with a DB
  – Lather, rinse, repeat
• See the $TT_HOME/apache-solr/solr-
  qa/conf/schema.xml for setup
• WikipediaWexIndexer.java for indexing code
Aside: Named Entity Recognition




• NER is the process of extracting proper names, etc.
  from text
• Plays a vital role in a QA and many other NLP systems
• Often solved using classification approaches
• Custom Query Parser takes in user’s natural
  language query, classifies it to find the Answer
  Type and generates Solr query
• Retrieve candidate passages that match
  keywords and expected answer type
• Unlike keyword search, we need to know
  exactly where matches occur
Answer Type Classification
• Answer Type examples:
  – Person (P), Location (L), Organization (O), Time
    Point (T), Duration (R), Money (M)
  – See page 248 for more
• Train an OpenNLP classifier off of a set of
  previously annotated questions, e.g.:
  – P Which French monarch reinstated the divine
    right of the monarchy to France and was known as
    `The Sun King' because of the splendour of his
    reign?
Scoring
Other Areas of NLP/Machine
         Learning
Clustering
• Group together content based
  on some notion of similarity
• Book covers (ch. 6):
  – Search result clustering using
    Carrot2
  – Whole collection clustering using
    Mahout
  – Topic Modeling
• Mahout comes with many
  different algorithms
Clustering Use Cases
• Google News

• Outlier detection in smart grids

• Recommendations
  – Products
  – People, etc.
In Focus: K-Means




http://en.wikipedia.org/wiki/K-means_clustering
Fuzzy-Wuzzy Strings




• Fuzzy string matching is a common, and difficult,
  problem
• Useful for solving problems like:
  – Did you mean spell checking
  – Auto-suggest
  – Record linkage
Common Approaches
• See com.tamingtext.fuzzy package
• Jaccard
  – Measure character overlap
• Levenshtein (Edit Distance)
  – Count the number of edits required to transform
    one word into the other
• Jaro-Winkler
  – Account for position
Trie
• The Trie is a very useful
  data structure for working
  with strings
• Find common
  subsequences
• Auto-suggest, others

• Ternary Search Trie
What’s Next?
Much Harder Problems
•   Chapter 9
•   Semantics, Pragmatics and beyond
•   Sentiment Analysis
•   Document and collection summarization
•   Relationship Extraction
•   Cross-language Search
•   Importance
Thank You!


• 3 copies of Taming Text
Resources
• http://www.manning.com/in
  gersoll
  – http://github.com/tamingtext/
    book
• http://www.tamingtext.com
• @tamingtext
• Me:
  – @gsingers
  – grant@lucidworks.com

Taming Text

  • 1.
    Taming Text Grant Ingersoll CTO, LucidWorks @tamingtext, @gsingers
  • 2.
    About the Book •Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning • Target Audience: You • All examples in Java, but concepts easily ported • Covers: – Search, Fuzzy string matching, human language basics, clustering, classification, Question Answering, Intro to advanced topics
  • 3.
    Answer Me This! •What is trimethylbenzene? – http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp e=qa&qa=true&qa.qf=body • who is ten minute warning? – http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy pe=qa&qa=true&qa.qf=body • what station serves the A train? – http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F& defType=qa&qa=true&qa.qf=body
  • 4.
  • 5.
    What does ittake to build this system?
  • 6.
    Agenda • Question AnsweringIn Detail – Building Blocks – Indexing – Search/Passage Retrieval – Classification – Scoring • Other Interesting Topics – Clustering – Fuzzy-Wuzzy Strings • What’s next? • Resources
  • 7.
    A Grain ofSalt • Text is a strange and magical world filled with… – Evil villains – Jesters – Wizards – Unicorns – Heroes! • In other words, no system will be perfect http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
  • 8.
    The Ugly Truth •You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as: – Preprocessing – Feature Selection – Sampling – Validation/testing/etc. – Content extraction – ETL • Corollary: Start with simple, tried and true algorithms, then iterate
  • 9.
    Getting Started • git clone git@github.com:tamingtext/book.git • See the README for pre-requisites • ./bin contains useful scripts to get started • You’ll need to download some pretty big dependencies: – OpenNLP Models – WordNet – Wikipedia subset
  • 10.
  • 11.
    What is QA? •You’ve seen QA in action already thanks to IBM and Jeopardy!  • Instead of providing 10 blue links, provide the answer! • Exercises many search and NLP features • See Ch. 8
  • 12.
  • 13.
    Building Blocks • SentenceDetection • Part of Speech Tagging • Parsing • Ch. 2
  • 14.
    QA in TamingText • Apache Solr for Passage Retrieval and integration • Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification • Custom code for Query Parsing, Scoring – See com.tamingtext.qa package • Wikipedia for “truth”
  • 15.
    Demo • $TT_HOME/bin/start-solr.sh solr-qa – http://localhost:8983/solr/answer • Once that is up and running – $TT_HOME/bin/indexWikipedia.sh --wikiFile ~/projects/manning/maven.tamingtext.com/freeb ase-wex-2011-01-18-articles-first10k.tsv • When done, you can ask questions!
  • 16.
    Indexing • Ingest rawdata into the system and make it available for search • Garbage In, Garbage Out – Need to spend some time understanding and modeling your data just like you would with a DB – Lather, rinse, repeat • See the $TT_HOME/apache-solr/solr- qa/conf/schema.xml for setup • WikipediaWexIndexer.java for indexing code
  • 17.
    Aside: Named EntityRecognition • NER is the process of extracting proper names, etc. from text • Plays a vital role in a QA and many other NLP systems • Often solved using classification approaches
  • 18.
    • Custom QueryParser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query • Retrieve candidate passages that match keywords and expected answer type • Unlike keyword search, we need to know exactly where matches occur
  • 19.
    Answer Type Classification •Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more • Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
  • 20.
  • 21.
    Other Areas ofNLP/Machine Learning
  • 22.
    Clustering • Group togethercontent based on some notion of similarity • Book covers (ch. 6): – Search result clustering using Carrot2 – Whole collection clustering using Mahout – Topic Modeling • Mahout comes with many different algorithms
  • 23.
    Clustering Use Cases •Google News • Outlier detection in smart grids • Recommendations – Products – People, etc.
  • 24.
  • 25.
    Fuzzy-Wuzzy Strings • Fuzzystring matching is a common, and difficult, problem • Useful for solving problems like: – Did you mean spell checking – Auto-suggest – Record linkage
  • 26.
    Common Approaches • Seecom.tamingtext.fuzzy package • Jaccard – Measure character overlap • Levenshtein (Edit Distance) – Count the number of edits required to transform one word into the other • Jaro-Winkler – Account for position
  • 27.
    Trie • The Trieis a very useful data structure for working with strings • Find common subsequences • Auto-suggest, others • Ternary Search Trie
  • 28.
  • 29.
    Much Harder Problems • Chapter 9 • Semantics, Pragmatics and beyond • Sentiment Analysis • Document and collection summarization • Relationship Extraction • Cross-language Search • Importance
  • 30.
    Thank You! • 3copies of Taming Text
  • 31.
    Resources • http://www.manning.com/in gersoll – http://github.com/tamingtext/ book • http://www.tamingtext.com • @tamingtext • Me: – @gsingers – grant@lucidworks.com