Your SlideShare is downloading. ×
Taming Text
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Taming Text

1,163

Published on

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See …

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,163
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Taming Text Grant Ingersoll CTO, LucidWorks@tamingtext, @gsingers
  • 2. About the Book• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning• Target Audience: You• All examples in Java, but concepts easily ported• Covers: – Search, Fuzzy string matching, human language basics, clustering, classification, Question Answering, Intro to advanced topics
  • 3. Answer Me This!• What is trimethylbenzene? – http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defTyp e=qa&qa=true&qa.qf=body• who is ten minute warning? – http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy pe=qa&qa=true&qa.qf=body• what station serves the A train? – http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F& defType=qa&qa=true&qa.qf=body
  • 4. Fact-based QA Demo
  • 5. What does it take to build this system?
  • 6. Agenda• Question Answering In Detail – Building Blocks – Indexing – Search/Passage Retrieval – Classification – Scoring• Other Interesting Topics – Clustering – Fuzzy-Wuzzy Strings• What’s next?• Resources
  • 7. A Grain of Salt • Text is a strange and magical world filled with… – Evil villains – Jesters – Wizards – Unicorns – Heroes! • In other words, no system will be perfecthttp://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
  • 8. The Ugly Truth• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as: – Preprocessing – Feature Selection – Sampling – Validation/testing/etc. – Content extraction – ETL• Corollary: Start with simple, tried and true algorithms, then iterate
  • 9. Getting Started• git clone git@github.com:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big dependencies: – OpenNLP Models – WordNet – Wikipedia subset
  • 10. Question Answering (QA)
  • 11. What is QA?• You’ve seen QA in action already thanks to IBM and Jeopardy! • Instead of providing 10 blue links, provide the answer!• Exercises many search and NLP features• See Ch. 8
  • 12. Simple QA Workflow
  • 13. Building Blocks• Sentence Detection• Part of Speech Tagging• Parsing• Ch. 2
  • 14. QA in Taming Text• Apache Solr for Passage Retrieval and integration• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification• Custom code for Query Parsing, Scoring – See com.tamingtext.qa package• Wikipedia for “truth”
  • 15. Demo• $TT_HOME/bin/start-solr.sh solr-qa – http://localhost:8983/solr/answer• Once that is up and running – $TT_HOME/bin/indexWikipedia.sh --wikiFile ~/projects/manning/maven.tamingtext.com/freeb ase-wex-2011-01-18-articles-first10k.tsv• When done, you can ask questions!
  • 16. Indexing• Ingest raw data into the system and make it available for search• Garbage In, Garbage Out – Need to spend some time understanding and modeling your data just like you would with a DB – Lather, rinse, repeat• See the $TT_HOME/apache-solr/solr- qa/conf/schema.xml for setup• WikipediaWexIndexer.java for indexing code
  • 17. Aside: Named Entity Recognition• NER is the process of extracting proper names, etc. from text• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches
  • 18. • Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query• Retrieve candidate passages that match keywords and expected answer type• Unlike keyword search, we need to know exactly where matches occur
  • 19. Answer Type Classification• Answer Type examples: – Person (P), Location (L), Organization (O), Time Point (T), Duration (R), Money (M) – See page 248 for more• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.: – P Which French monarch reinstated the divine right of the monarchy to France and was known as `The Sun King because of the splendour of his reign?
  • 20. Scoring
  • 21. Other Areas of NLP/Machine Learning
  • 22. Clustering• Group together content based on some notion of similarity• Book covers (ch. 6): – Search result clustering using Carrot2 – Whole collection clustering using Mahout – Topic Modeling• Mahout comes with many different algorithms
  • 23. Clustering Use Cases• Google News• Outlier detection in smart grids• Recommendations – Products – People, etc.
  • 24. In Focus: K-Meanshttp://en.wikipedia.org/wiki/K-means_clustering
  • 25. Fuzzy-Wuzzy Strings• Fuzzy string matching is a common, and difficult, problem• Useful for solving problems like: – Did you mean spell checking – Auto-suggest – Record linkage
  • 26. Common Approaches• See com.tamingtext.fuzzy package• Jaccard – Measure character overlap• Levenshtein (Edit Distance) – Count the number of edits required to transform one word into the other• Jaro-Winkler – Account for position
  • 27. Trie• The Trie is a very useful data structure for working with strings• Find common subsequences• Auto-suggest, others• Ternary Search Trie
  • 28. What’s Next?
  • 29. Much Harder Problems• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance
  • 30. Thank You!• 3 copies of Taming Text
  • 31. Resources• http://www.manning.com/in gersoll – http://github.com/tamingtext/ book• http://www.tamingtext.com• @tamingtext• Me: – @gsingers – grant@lucidworks.com

×