Intelligent Apps with
Apache Lucene, Mahout and
friends
Grant Ingersoll
Lucid Imagination, Inc.
Topics
What is an Intelligent Application?
Examples
I’ve heard of Lucene/Solr, but what else can I use?
Mahout
OpenNLP
Others? UIMA, Weka, Mallet, MinorThird, etc.
Building Blocks
Tying it all together
Lucid Imagination, Inc.
What is an Intelligent Application?
I favor a loose definition
Evolving as techniques get better
General Characteristics:
Embraces fuzziness and uncertainty by:
• Learning from past behavior and adapting
• Leveraging the masses while incorporating the personal
Provide Content Insight
• Organize vast quantities of data into consumable chunks
• Encourage Serendipity
Do what users want even if they don’t know it yet, but don’t turn
them off either
Lucid Imagination, Inc.
Caveats
I’m mostly interested in applications where:
Unstructured text is a component
• i.e. I’m not building a next-gen video game
Users interact via text, clicks, etc.
• Typing in queries
• Browsing links, reading ads/content, etc.
Some of these tools are useful for other applications too
Consider the topics here to be a toolkit, not all apps need all
features
Lucid Imagination, Inc.
Examples
http://www.netflix.com
Amazon
http://www.fancast.com
Yahoo
Apache Open Source Players
Lucene/Solr
http://lucene.apache.org
Mahout
http://mahout.apache.org
UIMA
http://uima.apache.org
Nutch
http://nutch.apache.org
Tika
http://tika.apache.org
Hadoop
http://hadoop.apache.org
ManifoldCF
http://incubator.apache.org/c
onnectors
Lucid Imagination, Inc.
Other Open Source Players
OpenNLP (ASL)
http://opennlp.sourceforge.net
-> Incubator?
Carrot2 (BSD)
http://project.carrot2.org/
MALLET (CPL)
http://mallet.cs.umass.edu/
Weka (GPL)
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Lucid Imagination, Inc.
Aggregating Analysis
User History
Discovery/Guides/Organizatio
n
Language
Analysis
Building Blocks
Content Users
Acquisition
Relationships
Search
Domain
Knowledge
Extraction
User Profile/Model Context
Adaptation
Lucid Imagination, Inc.
Building Blocks: Acquisition and Extraction
Garbage In Garbage Out
Acquisition:
Nutch
Solr Data Import Handler
ManifoldCF
Extraction
Tika (PDFBox, POI, etc.)
Lucid Imagination, Inc.
Building Blocks: Language Analysis
Basics:
Morphology, Tokenization, Stemming/Lemmatization, Language
Detection…
Lucene has extensive support, plus pluggable
Intermediate:
Phrases, Part of Speech, Collocations, Shallow Parsing…
Lucene, Mahout, OpenNLP
Advanced:
Concepts, Sentiment, Relationships, Deep Parsing…
Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Domain Knowledge
You, Your Business, Your Requirements
Focus groups
Examples:
Synonyms, taxonomies
Genre (sublanguage: jargon, abbreviations, etc.)
Content relationships (explicit and implicit links)
Metadata: location, time, authorship, content type
Tools:
Tika, Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Search
Search is often the interface through which users interact
with a system
Doesn’t require explicit typing in of keywords
Sometimes a search need not be a search
Less frequently used capabilities become more important:
Pluggable Query Parsing
Spans/Payloads
Terms, TermVectors
Lucene/Solr can actually stand-in for many of the higher
layers (organizational)
Building Blocks: Organization/Discovery
Organization
Classification
• Named Entity Extraction
Clustering
• Collection
• Search Results
Topic Modeling
Summarization
• Document
• Collection
Discovery/Guidance
Faceting/Clusters
Auto-suggest
Did you mean?
Related Searches
More Like This
Lucid Imagination, Inc.
Building Blocks: Relationships
Harness multilevel relationships
Within documents: phrases/collocations, co-reference resolution,
anaphora, even sentences, paragraphs have relationships
Doc <-> Doc:
• Explicit: links, citations, etc.
• Implicit: shared concepts/topics
User <-> Doc:
• Read/Rated/Reviewed/Shared…
User <-> User
• Explicit: Friend, Colleague, Reports to, friend of friend
• Implicit: email, Instant Msg, asked/answered question
Lucid Imagination, Inc.
Building Blocks: Users
History
Saved Searches -> Deeper analysis -> Alerts
Profile
Likes/Dislikes
Location
Roles
Enhance/Restrict Queries, personalize
scoring/ranking/recommendations
Lucid Imagination, Inc.
Building Blocks: Aggregating Analysis
You’re an Engineer, do you know what’s in your production
logs?
Log analysis
Who, what, when, where, why?
Hadoop, Pig, Mahout etc.
Classification/Clustering
Label/Group users based on their actions
• Power users, new users, etc.
• Mahout and other Machine Learning techniques
Lucid Imagination, Inc.
Adaptation
Automated
Retrain models based on user interactions on a regular basis
Manual
Lessons learned incorporated over time
Tying it Together
Key Extension Points
Analyzer Chain
UpdateProcessor
Request Handler
SearchComponent
Qparser(Plugin)
Event Listeners
Lucid Imagination, Inc.
Example
http://github.com/gsingers/ApacheCon2010
Work-in-Progress Proof of Concept
Wikipedia dataset
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-
pages-articles.xml.bz2
Index, classify, cluster, recommend
Lucid Imagination, Inc.
Indexing
Document
•Request Handler
Update
Proc. Chain
•Bayes Update
Request
Processor
•UIMA (SOLR-
2129)
Update
Handler
IndexWriter
Analysis
•NameFilter
•Payloads
•Sentence Det.
•Parsing
New
Searcher
Event
•Cluster
Collection
Lucid Imagination, Inc.
Searching
Query
• Request Handler
Query Comp
• QParser (SOLR-
1337)
• Analysis
• Spans
• DocList/Set
• Spatial
Clustering
Comp.
• Carrot2
• Mahout
Suggestions
• Spell Checking
• Auto Suggest
• Related Searches
(SOLR-2080)
Recommendations
• Item-Item
Results
Lucid Imagination, Inc.
Resources
Handles
@gsingers
grant@lucidimagination.com
http://blog.lucidimagination.com
http://lucene.grantingersoll.com
Taming Text by Grant Ingersoll, Thomas Morton and Drew
Farris
http://lucene.li/1c
Code: apachecon2010

Intelligent Apps with Apache Lucene, Mahout and Friends

  • 1.
    Intelligent Apps with ApacheLucene, Mahout and friends Grant Ingersoll
  • 2.
    Lucid Imagination, Inc. Topics Whatis an Intelligent Application? Examples I’ve heard of Lucene/Solr, but what else can I use? Mahout OpenNLP Others? UIMA, Weka, Mallet, MinorThird, etc. Building Blocks Tying it all together
  • 3.
    Lucid Imagination, Inc. Whatis an Intelligent Application? I favor a loose definition Evolving as techniques get better General Characteristics: Embraces fuzziness and uncertainty by: • Learning from past behavior and adapting • Leveraging the masses while incorporating the personal Provide Content Insight • Organize vast quantities of data into consumable chunks • Encourage Serendipity Do what users want even if they don’t know it yet, but don’t turn them off either
  • 4.
    Lucid Imagination, Inc. Caveats I’mmostly interested in applications where: Unstructured text is a component • i.e. I’m not building a next-gen video game Users interact via text, clicks, etc. • Typing in queries • Browsing links, reading ads/content, etc. Some of these tools are useful for other applications too Consider the topics here to be a toolkit, not all apps need all features
  • 5.
  • 6.
    Apache Open SourcePlayers Lucene/Solr http://lucene.apache.org Mahout http://mahout.apache.org UIMA http://uima.apache.org Nutch http://nutch.apache.org Tika http://tika.apache.org Hadoop http://hadoop.apache.org ManifoldCF http://incubator.apache.org/c onnectors
  • 7.
    Lucid Imagination, Inc. OtherOpen Source Players OpenNLP (ASL) http://opennlp.sourceforge.net -> Incubator? Carrot2 (BSD) http://project.carrot2.org/ MALLET (CPL) http://mallet.cs.umass.edu/ Weka (GPL) http://www.cs.waikato.ac.nz/~ml/weka/index.html
  • 8.
    Lucid Imagination, Inc. AggregatingAnalysis User History Discovery/Guides/Organizatio n Language Analysis Building Blocks Content Users Acquisition Relationships Search Domain Knowledge Extraction User Profile/Model Context Adaptation
  • 9.
    Lucid Imagination, Inc. BuildingBlocks: Acquisition and Extraction Garbage In Garbage Out Acquisition: Nutch Solr Data Import Handler ManifoldCF Extraction Tika (PDFBox, POI, etc.)
  • 10.
    Lucid Imagination, Inc. BuildingBlocks: Language Analysis Basics: Morphology, Tokenization, Stemming/Lemmatization, Language Detection… Lucene has extensive support, plus pluggable Intermediate: Phrases, Part of Speech, Collocations, Shallow Parsing… Lucene, Mahout, OpenNLP Advanced: Concepts, Sentiment, Relationships, Deep Parsing… Machine Learning tools like Mahout
  • 11.
    Lucid Imagination, Inc. BuildingBlocks: Domain Knowledge You, Your Business, Your Requirements Focus groups Examples: Synonyms, taxonomies Genre (sublanguage: jargon, abbreviations, etc.) Content relationships (explicit and implicit links) Metadata: location, time, authorship, content type Tools: Tika, Machine Learning tools like Mahout
  • 12.
    Lucid Imagination, Inc. BuildingBlocks: Search Search is often the interface through which users interact with a system Doesn’t require explicit typing in of keywords Sometimes a search need not be a search Less frequently used capabilities become more important: Pluggable Query Parsing Spans/Payloads Terms, TermVectors Lucene/Solr can actually stand-in for many of the higher layers (organizational)
  • 13.
    Building Blocks: Organization/Discovery Organization Classification •Named Entity Extraction Clustering • Collection • Search Results Topic Modeling Summarization • Document • Collection Discovery/Guidance Faceting/Clusters Auto-suggest Did you mean? Related Searches More Like This
  • 14.
    Lucid Imagination, Inc. BuildingBlocks: Relationships Harness multilevel relationships Within documents: phrases/collocations, co-reference resolution, anaphora, even sentences, paragraphs have relationships Doc <-> Doc: • Explicit: links, citations, etc. • Implicit: shared concepts/topics User <-> Doc: • Read/Rated/Reviewed/Shared… User <-> User • Explicit: Friend, Colleague, Reports to, friend of friend • Implicit: email, Instant Msg, asked/answered question
  • 15.
    Lucid Imagination, Inc. BuildingBlocks: Users History Saved Searches -> Deeper analysis -> Alerts Profile Likes/Dislikes Location Roles Enhance/Restrict Queries, personalize scoring/ranking/recommendations
  • 16.
    Lucid Imagination, Inc. BuildingBlocks: Aggregating Analysis You’re an Engineer, do you know what’s in your production logs? Log analysis Who, what, when, where, why? Hadoop, Pig, Mahout etc. Classification/Clustering Label/Group users based on their actions • Power users, new users, etc. • Mahout and other Machine Learning techniques
  • 17.
    Lucid Imagination, Inc. Adaptation Automated Retrainmodels based on user interactions on a regular basis Manual Lessons learned incorporated over time
  • 18.
    Tying it Together KeyExtension Points Analyzer Chain UpdateProcessor Request Handler SearchComponent Qparser(Plugin) Event Listeners
  • 19.
    Lucid Imagination, Inc. Example http://github.com/gsingers/ApacheCon2010 Work-in-ProgressProof of Concept Wikipedia dataset http://people.apache.org/~gsingers/wikipedia/enwiki-20070527- pages-articles.xml.bz2 Index, classify, cluster, recommend
  • 20.
    Lucid Imagination, Inc. Indexing Document •RequestHandler Update Proc. Chain •Bayes Update Request Processor •UIMA (SOLR- 2129) Update Handler IndexWriter Analysis •NameFilter •Payloads •Sentence Det. •Parsing New Searcher Event •Cluster Collection
  • 21.
    Lucid Imagination, Inc. Searching Query •Request Handler Query Comp • QParser (SOLR- 1337) • Analysis • Spans • DocList/Set • Spatial Clustering Comp. • Carrot2 • Mahout Suggestions • Spell Checking • Auto Suggest • Related Searches (SOLR-2080) Recommendations • Item-Item Results
  • 22.

Editor's Notes

  • #4 Do what users expect – Go beyond just UI design Early days of Amazon recommender
  • #6 In fact, the Gmail shows 3 examples
  • #9 On the users side, I won’t go into too much detail about things like history, profile, modeling
  • #10 In my day to day experience, this seemingly mundane task is where you will spend a good amount of time
  • #13 You can build recommenders, classifiers, clustering, etc. on L/S