SlideShare a Scribd company logo
Intelligent Apps with
Apache Lucene, Mahout and
friends
Grant Ingersoll
Lucid Imagination, Inc.
Topics
What is an Intelligent Application?
Examples
I’ve heard of Lucene/Solr, but what else can I use?
Mahout
OpenNLP
Others? UIMA, Weka, Mallet, MinorThird, etc.
Building Blocks
Tying it all together
Lucid Imagination, Inc.
What is an Intelligent Application?
I favor a loose definition
Evolving as techniques get better
General Characteristics:
Embraces fuzziness and uncertainty by:
• Learning from past behavior and adapting
• Leveraging the masses while incorporating the personal
Provide Content Insight
• Organize vast quantities of data into consumable chunks
• Encourage Serendipity
Do what users want even if they don’t know it yet, but don’t turn
them off either
Lucid Imagination, Inc.
Caveats
I’m mostly interested in applications where:
Unstructured text is a component
• i.e. I’m not building a next-gen video game
Users interact via text, clicks, etc.
• Typing in queries
• Browsing links, reading ads/content, etc.
Some of these tools are useful for other applications too
Consider the topics here to be a toolkit, not all apps need all
features
Lucid Imagination, Inc.
Examples
http://www.netflix.com
Amazon
http://www.fancast.com
Yahoo
Apache Open Source Players
Lucene/Solr
http://lucene.apache.org
Mahout
http://mahout.apache.org
UIMA
http://uima.apache.org
Nutch
http://nutch.apache.org
Tika
http://tika.apache.org
Hadoop
http://hadoop.apache.org
ManifoldCF
http://incubator.apache.org/c
onnectors
Lucid Imagination, Inc.
Other Open Source Players
OpenNLP (ASL)
http://opennlp.sourceforge.net
-> Incubator?
Carrot2 (BSD)
http://project.carrot2.org/
MALLET (CPL)
http://mallet.cs.umass.edu/
Weka (GPL)
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Lucid Imagination, Inc.
Aggregating Analysis
User History
Discovery/Guides/Organizatio
n
Language
Analysis
Building Blocks
Content Users
Acquisition
Relationships
Search
Domain
Knowledge
Extraction
User Profile/Model Context
Adaptation
Lucid Imagination, Inc.
Building Blocks: Acquisition and Extraction
Garbage In Garbage Out
Acquisition:
Nutch
Solr Data Import Handler
ManifoldCF
Extraction
Tika (PDFBox, POI, etc.)
Lucid Imagination, Inc.
Building Blocks: Language Analysis
Basics:
Morphology, Tokenization, Stemming/Lemmatization, Language
Detection…
Lucene has extensive support, plus pluggable
Intermediate:
Phrases, Part of Speech, Collocations, Shallow Parsing…
Lucene, Mahout, OpenNLP
Advanced:
Concepts, Sentiment, Relationships, Deep Parsing…
Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Domain Knowledge
You, Your Business, Your Requirements
Focus groups
Examples:
Synonyms, taxonomies
Genre (sublanguage: jargon, abbreviations, etc.)
Content relationships (explicit and implicit links)
Metadata: location, time, authorship, content type
Tools:
Tika, Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Search
Search is often the interface through which users interact
with a system
Doesn’t require explicit typing in of keywords
Sometimes a search need not be a search
Less frequently used capabilities become more important:
Pluggable Query Parsing
Spans/Payloads
Terms, TermVectors
Lucene/Solr can actually stand-in for many of the higher
layers (organizational)
Building Blocks: Organization/Discovery
Organization
Classification
• Named Entity Extraction
Clustering
• Collection
• Search Results
Topic Modeling
Summarization
• Document
• Collection
Discovery/Guidance
Faceting/Clusters
Auto-suggest
Did you mean?
Related Searches
More Like This
Lucid Imagination, Inc.
Building Blocks: Relationships
Harness multilevel relationships
Within documents: phrases/collocations, co-reference resolution,
anaphora, even sentences, paragraphs have relationships
Doc <-> Doc:
• Explicit: links, citations, etc.
• Implicit: shared concepts/topics
User <-> Doc:
• Read/Rated/Reviewed/Shared…
User <-> User
• Explicit: Friend, Colleague, Reports to, friend of friend
• Implicit: email, Instant Msg, asked/answered question
Lucid Imagination, Inc.
Building Blocks: Users
History
Saved Searches -> Deeper analysis -> Alerts
Profile
Likes/Dislikes
Location
Roles
Enhance/Restrict Queries, personalize
scoring/ranking/recommendations
Lucid Imagination, Inc.
Building Blocks: Aggregating Analysis
You’re an Engineer, do you know what’s in your production
logs?
Log analysis
Who, what, when, where, why?
Hadoop, Pig, Mahout etc.
Classification/Clustering
Label/Group users based on their actions
• Power users, new users, etc.
• Mahout and other Machine Learning techniques
Lucid Imagination, Inc.
Adaptation
Automated
Retrain models based on user interactions on a regular basis
Manual
Lessons learned incorporated over time
Tying it Together
Key Extension Points
Analyzer Chain
UpdateProcessor
Request Handler
SearchComponent
Qparser(Plugin)
Event Listeners
Lucid Imagination, Inc.
Example
http://github.com/gsingers/ApacheCon2010
Work-in-Progress Proof of Concept
Wikipedia dataset
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-
pages-articles.xml.bz2
Index, classify, cluster, recommend
Lucid Imagination, Inc.
Indexing
Document
•Request Handler
Update
Proc. Chain
•Bayes Update
Request
Processor
•UIMA (SOLR-
2129)
Update
Handler
IndexWriter
Analysis
•NameFilter
•Payloads
•Sentence Det.
•Parsing
New
Searcher
Event
•Cluster
Collection
Lucid Imagination, Inc.
Searching
Query
• Request Handler
Query Comp
• QParser (SOLR-
1337)
• Analysis
• Spans
• DocList/Set
• Spatial
Clustering
Comp.
• Carrot2
• Mahout
Suggestions
• Spell Checking
• Auto Suggest
• Related Searches
(SOLR-2080)
Recommendations
• Item-Item
Results
Lucid Imagination, Inc.
Resources
Handles
@gsingers
grant@lucidimagination.com
http://blog.lucidimagination.com
http://lucene.grantingersoll.com
Taming Text by Grant Ingersoll, Thomas Morton and Drew
Farris
http://lucene.li/1c
Code: apachecon2010

More Related Content

Similar to Intelligent Apps with Apache Lucene, Mahout and Friends

Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Nathalie Reid
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
Gaurav Marwaha
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special Libraries
Michelle Kraft
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation
Laurian Vega
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentationjward5519
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdf
SairaNoreen5
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presence
Erika Sorto
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions R A Akerkar
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KM
Invotra
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
guestec15e68
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
Albemarle County Public Schools
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
guestec15e68
 
Presentation on collaboration
Presentation on collaborationPresentation on collaboration
Presentation on collaboration
groupVision | optimizing group collaboration
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems GroupCo
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final
Hallie Wilfert
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imootKristina Hollis
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Tools
tstephens
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
sagarjsicg
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric Identity
Eduserv Foundation
 

Similar to Intelligent Apps with Apache Lucene, Mahout and Friends (20)

Rusa nov20 2013
Rusa nov20 2013Rusa nov20 2013
Rusa nov20 2013
 
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special Libraries
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentation
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdf
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presence
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KM
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Presentation on collaboration
Presentation on collaborationPresentation on collaboration
Presentation on collaboration
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imoot
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Tools
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric Identity
 

More from Grant Ingersoll

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
Grant Ingersoll
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 

Recently uploaded

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Intelligent Apps with Apache Lucene, Mahout and Friends

  • 1. Intelligent Apps with Apache Lucene, Mahout and friends Grant Ingersoll
  • 2. Lucid Imagination, Inc. Topics What is an Intelligent Application? Examples I’ve heard of Lucene/Solr, but what else can I use? Mahout OpenNLP Others? UIMA, Weka, Mallet, MinorThird, etc. Building Blocks Tying it all together
  • 3. Lucid Imagination, Inc. What is an Intelligent Application? I favor a loose definition Evolving as techniques get better General Characteristics: Embraces fuzziness and uncertainty by: • Learning from past behavior and adapting • Leveraging the masses while incorporating the personal Provide Content Insight • Organize vast quantities of data into consumable chunks • Encourage Serendipity Do what users want even if they don’t know it yet, but don’t turn them off either
  • 4. Lucid Imagination, Inc. Caveats I’m mostly interested in applications where: Unstructured text is a component • i.e. I’m not building a next-gen video game Users interact via text, clicks, etc. • Typing in queries • Browsing links, reading ads/content, etc. Some of these tools are useful for other applications too Consider the topics here to be a toolkit, not all apps need all features
  • 6. Apache Open Source Players Lucene/Solr http://lucene.apache.org Mahout http://mahout.apache.org UIMA http://uima.apache.org Nutch http://nutch.apache.org Tika http://tika.apache.org Hadoop http://hadoop.apache.org ManifoldCF http://incubator.apache.org/c onnectors
  • 7. Lucid Imagination, Inc. Other Open Source Players OpenNLP (ASL) http://opennlp.sourceforge.net -> Incubator? Carrot2 (BSD) http://project.carrot2.org/ MALLET (CPL) http://mallet.cs.umass.edu/ Weka (GPL) http://www.cs.waikato.ac.nz/~ml/weka/index.html
  • 8. Lucid Imagination, Inc. Aggregating Analysis User History Discovery/Guides/Organizatio n Language Analysis Building Blocks Content Users Acquisition Relationships Search Domain Knowledge Extraction User Profile/Model Context Adaptation
  • 9. Lucid Imagination, Inc. Building Blocks: Acquisition and Extraction Garbage In Garbage Out Acquisition: Nutch Solr Data Import Handler ManifoldCF Extraction Tika (PDFBox, POI, etc.)
  • 10. Lucid Imagination, Inc. Building Blocks: Language Analysis Basics: Morphology, Tokenization, Stemming/Lemmatization, Language Detection… Lucene has extensive support, plus pluggable Intermediate: Phrases, Part of Speech, Collocations, Shallow Parsing… Lucene, Mahout, OpenNLP Advanced: Concepts, Sentiment, Relationships, Deep Parsing… Machine Learning tools like Mahout
  • 11. Lucid Imagination, Inc. Building Blocks: Domain Knowledge You, Your Business, Your Requirements Focus groups Examples: Synonyms, taxonomies Genre (sublanguage: jargon, abbreviations, etc.) Content relationships (explicit and implicit links) Metadata: location, time, authorship, content type Tools: Tika, Machine Learning tools like Mahout
  • 12. Lucid Imagination, Inc. Building Blocks: Search Search is often the interface through which users interact with a system Doesn’t require explicit typing in of keywords Sometimes a search need not be a search Less frequently used capabilities become more important: Pluggable Query Parsing Spans/Payloads Terms, TermVectors Lucene/Solr can actually stand-in for many of the higher layers (organizational)
  • 13. Building Blocks: Organization/Discovery Organization Classification • Named Entity Extraction Clustering • Collection • Search Results Topic Modeling Summarization • Document • Collection Discovery/Guidance Faceting/Clusters Auto-suggest Did you mean? Related Searches More Like This
  • 14. Lucid Imagination, Inc. Building Blocks: Relationships Harness multilevel relationships Within documents: phrases/collocations, co-reference resolution, anaphora, even sentences, paragraphs have relationships Doc <-> Doc: • Explicit: links, citations, etc. • Implicit: shared concepts/topics User <-> Doc: • Read/Rated/Reviewed/Shared… User <-> User • Explicit: Friend, Colleague, Reports to, friend of friend • Implicit: email, Instant Msg, asked/answered question
  • 15. Lucid Imagination, Inc. Building Blocks: Users History Saved Searches -> Deeper analysis -> Alerts Profile Likes/Dislikes Location Roles Enhance/Restrict Queries, personalize scoring/ranking/recommendations
  • 16. Lucid Imagination, Inc. Building Blocks: Aggregating Analysis You’re an Engineer, do you know what’s in your production logs? Log analysis Who, what, when, where, why? Hadoop, Pig, Mahout etc. Classification/Clustering Label/Group users based on their actions • Power users, new users, etc. • Mahout and other Machine Learning techniques
  • 17. Lucid Imagination, Inc. Adaptation Automated Retrain models based on user interactions on a regular basis Manual Lessons learned incorporated over time
  • 18. Tying it Together Key Extension Points Analyzer Chain UpdateProcessor Request Handler SearchComponent Qparser(Plugin) Event Listeners
  • 19. Lucid Imagination, Inc. Example http://github.com/gsingers/ApacheCon2010 Work-in-Progress Proof of Concept Wikipedia dataset http://people.apache.org/~gsingers/wikipedia/enwiki-20070527- pages-articles.xml.bz2 Index, classify, cluster, recommend
  • 20. Lucid Imagination, Inc. Indexing Document •Request Handler Update Proc. Chain •Bayes Update Request Processor •UIMA (SOLR- 2129) Update Handler IndexWriter Analysis •NameFilter •Payloads •Sentence Det. •Parsing New Searcher Event •Cluster Collection
  • 21. Lucid Imagination, Inc. Searching Query • Request Handler Query Comp • QParser (SOLR- 1337) • Analysis • Spans • DocList/Set • Spatial Clustering Comp. • Carrot2 • Mahout Suggestions • Spell Checking • Auto Suggest • Related Searches (SOLR-2080) Recommendations • Item-Item Results

Editor's Notes

  1. Do what users expect – Go beyond just UI design Early days of Amazon recommender
  2. In fact, the Gmail shows 3 examples
  3. On the users side, I won’t go into too much detail about things like history, profile, modeling
  4. In my day to day experience, this seemingly mundane task is where you will spend a good amount of time
  5. You can build recommenders, classifiers, clustering, etc. on L/S