Your SlideShare is downloading. ×
0
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Intelligent Apps with Apache Lucene, Mahout and Friends

7,583

Published on

Talks about building blocks of intelligent applications using Lucene, Solr, Mahout, OpenNLP, etc.

Talks about building blocks of intelligent applications using Lucene, Solr, Mahout, OpenNLP, etc.

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,583
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
229
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Do what users expect – Go beyond just UI design

    Early days of Amazon recommender
  • In fact, the Gmail shows 3 examples
  • On the users side, I won’t go into too much detail about things like history, profile, modeling
  • In my day to day experience, this seemingly mundane task is where you will spend a good amount of time
  • You can build recommenders, classifiers, clustering, etc. on L/S
  • Transcript

    • 1. Intelligent Apps with Apache Lucene, Mahout and friends Grant Ingersoll
    • 2. Lucid Imagination, Inc. Topics What is an Intelligent Application? Examples I’ve heard of Lucene/Solr, but what else can I use? Mahout OpenNLP Others? UIMA, Weka, Mallet, MinorThird, etc. Building Blocks Tying it all together
    • 3. Lucid Imagination, Inc. What is an Intelligent Application? I favor a loose definition Evolving as techniques get better General Characteristics: Embraces fuzziness and uncertainty by: • Learning from past behavior and adapting • Leveraging the masses while incorporating the personal Provide Content Insight • Organize vast quantities of data into consumable chunks • Encourage Serendipity Do what users want even if they don’t know it yet, but don’t turn them off either
    • 4. Lucid Imagination, Inc. Caveats I’m mostly interested in applications where: Unstructured text is a component • i.e. I’m not building a next-gen video game Users interact via text, clicks, etc. • Typing in queries • Browsing links, reading ads/content, etc. Some of these tools are useful for other applications too Consider the topics here to be a toolkit, not all apps need all features
    • 5. Lucid Imagination, Inc. Examples http://www.netflix.com Amazon http://www.fancast.com Yahoo
    • 6. Apache Open Source Players Lucene/Solr http://lucene.apache.org Mahout http://mahout.apache.org UIMA http://uima.apache.org Nutch http://nutch.apache.org Tika http://tika.apache.org Hadoop http://hadoop.apache.org ManifoldCF http://incubator.apache.org/c onnectors
    • 7. Lucid Imagination, Inc. Other Open Source Players OpenNLP (ASL) http://opennlp.sourceforge.net -> Incubator? Carrot2 (BSD) http://project.carrot2.org/ MALLET (CPL) http://mallet.cs.umass.edu/ Weka (GPL) http://www.cs.waikato.ac.nz/~ml/weka/index.html
    • 8. Lucid Imagination, Inc. Aggregating Analysis User History Discovery/Guides/Organizatio n Language Analysis Building Blocks Content Users Acquisition Relationships Search Domain Knowledge Extraction User Profile/Model Context Adaptation
    • 9. Lucid Imagination, Inc. Building Blocks: Acquisition and Extraction Garbage In Garbage Out Acquisition: Nutch Solr Data Import Handler ManifoldCF Extraction Tika (PDFBox, POI, etc.)
    • 10. Lucid Imagination, Inc. Building Blocks: Language Analysis Basics: Morphology, Tokenization, Stemming/Lemmatization, Language Detection… Lucene has extensive support, plus pluggable Intermediate: Phrases, Part of Speech, Collocations, Shallow Parsing… Lucene, Mahout, OpenNLP Advanced: Concepts, Sentiment, Relationships, Deep Parsing… Machine Learning tools like Mahout
    • 11. Lucid Imagination, Inc. Building Blocks: Domain Knowledge You, Your Business, Your Requirements Focus groups Examples: Synonyms, taxonomies Genre (sublanguage: jargon, abbreviations, etc.) Content relationships (explicit and implicit links) Metadata: location, time, authorship, content type Tools: Tika, Machine Learning tools like Mahout
    • 12. Lucid Imagination, Inc. Building Blocks: Search Search is often the interface through which users interact with a system Doesn’t require explicit typing in of keywords Sometimes a search need not be a search Less frequently used capabilities become more important: Pluggable Query Parsing Spans/Payloads Terms, TermVectors Lucene/Solr can actually stand-in for many of the higher layers (organizational)
    • 13. Building Blocks: Organization/Discovery Organization Classification • Named Entity Extraction Clustering • Collection • Search Results Topic Modeling Summarization • Document • Collection Discovery/Guidance Faceting/Clusters Auto-suggest Did you mean? Related Searches More Like This
    • 14. Lucid Imagination, Inc. Building Blocks: Relationships Harness multilevel relationships Within documents: phrases/collocations, co-reference resolution, anaphora, even sentences, paragraphs have relationships Doc <-> Doc: • Explicit: links, citations, etc. • Implicit: shared concepts/topics User <-> Doc: • Read/Rated/Reviewed/Shared… User <-> User • Explicit: Friend, Colleague, Reports to, friend of friend • Implicit: email, Instant Msg, asked/answered question
    • 15. Lucid Imagination, Inc. Building Blocks: Users History Saved Searches -> Deeper analysis -> Alerts Profile Likes/Dislikes Location Roles Enhance/Restrict Queries, personalize scoring/ranking/recommendations
    • 16. Lucid Imagination, Inc. Building Blocks: Aggregating Analysis You’re an Engineer, do you know what’s in your production logs? Log analysis Who, what, when, where, why? Hadoop, Pig, Mahout etc. Classification/Clustering Label/Group users based on their actions • Power users, new users, etc. • Mahout and other Machine Learning techniques
    • 17. Lucid Imagination, Inc. Adaptation Automated Retrain models based on user interactions on a regular basis Manual Lessons learned incorporated over time
    • 18. Tying it Together Key Extension Points Analyzer Chain UpdateProcessor Request Handler SearchComponent Qparser(Plugin) Event Listeners
    • 19. Lucid Imagination, Inc. Example http://github.com/gsingers/ApacheCon2010 Work-in-Progress Proof of Concept Wikipedia dataset http://people.apache.org/~gsingers/wikipedia/enwiki-20070527- pages-articles.xml.bz2 Index, classify, cluster, recommend
    • 20. Lucid Imagination, Inc. Indexing Document •Request Handler Update Proc. Chain •Bayes Update Request Processor •UIMA (SOLR- 2129) Update Handler IndexWriter Analysis •NameFilter •Payloads •Sentence Det. •Parsing New Searcher Event •Cluster Collection
    • 21. Lucid Imagination, Inc. Searching Query • Request Handler Query Comp • QParser (SOLR- 1337) • Analysis • Spans • DocList/Set • Spatial Clustering Comp. • Carrot2 • Mahout Suggestions • Spell Checking • Auto Suggest • Related Searches (SOLR-2080) Recommendations • Item-Item Results
    • 22. Lucid Imagination, Inc. Resources Handles @gsingers grant@lucidimagination.com http://blog.lucidimagination.com http://lucene.grantingersoll.com Taming Text by Grant Ingersoll, Thomas Morton and Drew Farris http://lucene.li/1c Code: apachecon2010

    ×