SlideShare a Scribd company logo
1 of 32
Download to read offline
Faceted Search

   New York CTO Club
   December 9, 2009



 Daniel Tunkelang, Google
Otis Gospodneti!, Sematext
Agenda
Daniel:
!
    What is faceted search?
!
    Why use faceted search?
!
    Thoughts about design and user experience.


Otis:
!
    What are Lucene and Solr?
!
    Why use an open-source search library?
!
    Thoughts about implementation.
“Regular” Search
Interface:
!   User expresses information need as short query.
!   Search engine returns ranked, pageable result set.

User happy when...
!   Top-ranked result satisfies information need.
!   At least some result on first page is relevant.

User unhappy when...
!   No result on first page satisfies information need.
!   Results misleadingly appear relevant (bait and switch).
Relevance Is Subjective

Relevance is defined as a measure of
information conveyed by a document relative to
a query.

It is shown that the relationship between the
document and the query, though necessary, is
not sufficient to determine relevance.


William Goffman, On relevance as a measure, 1964.
Regular Search Experience
Assumptions Are Dangerous
                      !
                          self-awareness

  tf-idf
           PageRank   !
                          self-expression

                      !
                          model knows best

                      !
                          answer is a document

                      !
                          one-shot query
What is Faceted Search?
!   Best understood through examples.
       "   See the following slides.
       "   Or shop on almost any ecommerce site.
!   Facets = multiple ways to organize information.
       "   Often based on available structured information.
       "   But not always, e.g., facets obtained via text mining.
!   Typical interaction:
       "   User starts with a full-text search.
       "   Facets guide query refinement process.
Faceted Search for News
Faceted Search for People
Faceted Search for Breakfast
But Facets are Not a Silver Bullet...
!   Screen real estate is finite.
       "   Choose facets wisely.
       "   Choose facet values wisely for monster facets.
!   Multiple selection within a facet is powerful, but...
       "   Has to be intuitive, especially AND vs. OR.
       "   Even trickier for hierarchical facets.
!   Search relevance still matters!
       "   Most faceted search applications rank results.
       "   Irrelevant results " irrelevant facet refinements.
Exploring Information Science
Deliver Precision and Recall




Easier said than done!

Ranking of facet values is an open research topic.
Be Careful with Faceted Search!



     Cameras have artists?!
Clarify, Then Refine
Take-Aways
!   Faceted search addresses the subjectivity of
    relevance and information overload.
!   But deploying faceted search effectively
    requires that you think about user experience.
!   Recommended reading:
       "   My thin book entitled Faceted Search
       "   Marti Hearst's book on Search User Interfaces
       "   Peter Morville's upcoming book on Search Patterns
Faceted Search with Lucene & Solr




         Otis Gospodneti!, Sematext
What is / isn't Lucene
!   Free, ASL, Java IR library, Jar
!   Doug Cutting, ASF, 2001
!   Application agnostic: Indexing & Searching
!   High performance, scalable
!   No dependencies
!   Heavily ported
!   No: crawler, rich doc parser, turn-key solution
!   No: out of the box faceted search-capability... but...
What is/isn't Solr
!
    Indexing/Search server with HTTP API built on
    top of Lucene
!
    Fast & scalable (distributed search, index
    replication)#
!
    XML, JSON, Ruby, Perl, PHP, javabin
!
    No: crawler (but Nutch ==> Solr works)#
!
    Yes: rich text parser
!
    Yes: Faceted Search out of the box!
Solr and Faceted Search
!
    3 Types of facets: Field Values (text), Dates,
    Queries.
!
    “Text”: return counts for all/top terms in a field
    for a result set - e.g. categories a la Amazon
!
    Dates: return counts for docs in specified date
    ranges
!
    Queries: return counts for docs that also match
    a given query - handy for number ranges (think
    prices!)#
Facet Field Requirements
!
    Must be indexed
!
    Often not tokenized
!
    Often not altered (lowercase, punctuation)#
!
    Storing not required
!
    Multivalued fields OK
Turn It On
!
    0 facets:
    !
        http://host:80/solr/select?q=foo

!
    1 facet:
    !
        http://host:80/solr/select?q=foo&facet=true&facet.field=category

!
    N facets:
    !
        http://host:80/solr/select?
        q=foo&facet=true&facet.field=category&facet.field=inStock

!
    facet=true or facet.on
Text Facet Response
<result numFound="4" start="0"/>
                                       !
                                           facet.mincount=1 to
<lst name="facet_counts">

<lst name="facet_fields">
                                           avoid 0-count facet
 <lst name="category">                     values
     <int name="electronics">3</int>   !
                                           facet.limit=N to limit to
     <int name="copier">0</int>
                                           top N facet values
 </lst>

 <lst name="inStock">                  !
                                           facet.missing=true to
     <int name="false">3</int>             catch uncategorized
     <int name="true">1</int>

 </lst>
                                       !
                                           lots of other options!
</lst>

</lst>
Date Facets
!
    http://.../solr/select/?
    q=*:*&rows=0&facet=true&facet.date=timesta
    mp&facet.date.start=NOW/DAY-
    5DAYS&facet.date.end=NOW/DAY
    %2B1DAY&facet.date.gap=%2B1DAY
!
    (%2B1 ==> +1)#
!
    Solr Date Math Parser syntax: /HOUR,
    +2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,
    +6MONTHS+3DAYS/DAY
Date Facet Response
<result name="response" numFound="42" start="0"/>

<lst name="facet_counts">

<lst name="facet_dates">

 <lst name="timestamp">

     <int name="2007-08-11T00:00:00.000Z">1</int>

     <int name="2007-08-12T00:00:00.000Z">5</int>

     <int name="2007-08-13T00:00:00.000Z">3</int>

     <int name="2007-08-14T00:00:00.000Z">7</int>

     <int name="2007-08-15T00:00:00.000Z">2</int>

     <int name="2007-08-16T00:00:00.000Z">16</int>

     <str name="gap">+1DAY</str>

     <date name="end">2007-08-17T00:00:00Z</date>

 </lst>
Query Facets
!
    http://.../solr/select?
    q=shoes&rows=0&facet=true&facet.field=inStoc
    k&facet.query=price:
    [*+TO+500]&facet.query=price:[500+TO+*]
!
    Avoids the bucket-at-index-time work-around
!
    Keep queries disjoint
Query Facet Response
<result numFound="3" start="0"/>

<lst name="facet_counts">

<lst name="facet_queries">

 <int name="price:[* TO 500]">3</int>

 <int name="price:[500 TO *]">1</int>

</lst>

<lst name="facet_fields">

 <lst name="inStock">

     <int name="false">3</int>

     <int name="true">1</int>

 </lst>

</lst>

</lst>
UI Integration
!
    Use Filter Queries via fq
!
    http://.../solr/select?
    q=shoes&facet=true&facet.field=category&
    fq=price:[0 TO 300]
!
    http://.../solr/select?
    q=shoes&facet=true&facet.field=category&
    fq=price:[0 TO 300]&fq=inStock:true
!
    Important: single request does it all
State of Lucene & Solr
!
    Super healthy community, exploding
    development
!
    Lucene 3.0 – 2009-11-25:
       !
           Performance, faster range queries, clean API, better
           Unicode support, more non-English support
!
    Solr 1.4 – 2009-11-10:
       !
           Performance, new replication, Db indexing, rich-doc
           indexing, results clustering, faster response protocol,
           deduplication...
Lucene, Solr, Enterprise
!
    Free: Community
       !
           Lucene ~ 600 emails/month (dev: 2000/month)#
       !
           Solr ~1300 emails/month (dev: 800/month)#


!
    Commercial: Support Subscriptions
       !
           Sematext
       !
           Lucid Imagination

More Related Content

Viewers also liked

Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
NIKHIL NAIR
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
JSCHO9
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
201014161
 

Viewers also liked (17)

Data mining
Data miningData mining
Data mining
 
Automatically mining facets for queries from their search results
Automatically mining facets for queries from their search resultsAutomatically mining facets for queries from their search results
Automatically mining facets for queries from their search results
 
SharePoint Jumpstart #3: Navigation, Metadata, & Faceted Search: Approaches &...
SharePoint Jumpstart #3: Navigation, Metadata, & Faceted Search: Approaches &...SharePoint Jumpstart #3: Navigation, Metadata, & Faceted Search: Approaches &...
SharePoint Jumpstart #3: Navigation, Metadata, & Faceted Search: Approaches &...
 
Designing For Discovery With Faceted Navigation
Designing For Discovery With Faceted NavigationDesigning For Discovery With Faceted Navigation
Designing For Discovery With Faceted Navigation
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
The Four Pillars of Search Engine Optimization (SEO)
The Four Pillars of Search Engine Optimization (SEO)The Four Pillars of Search Engine Optimization (SEO)
The Four Pillars of Search Engine Optimization (SEO)
 
Faceted Classification System in Libraries
Faceted Classification System in LibrariesFaceted Classification System in Libraries
Faceted Classification System in Libraries
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Ecommerce SEO: Boosting visibility with faceted navigation | Slides from Brig...
Ecommerce SEO: Boosting visibility with faceted navigation | Slides from Brig...Ecommerce SEO: Boosting visibility with faceted navigation | Slides from Brig...
Ecommerce SEO: Boosting visibility with faceted navigation | Slides from Brig...
 
Comparative study of major classification schemes
Comparative study of major classification schemesComparative study of major classification schemes
Comparative study of major classification schemes
 
Non Functional Requirement.
Non Functional Requirement.Non Functional Requirement.
Non Functional Requirement.
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Functional requirements: Thinking Like A Pirate
Functional requirements: Thinking Like A PirateFunctional requirements: Thinking Like A Pirate
Functional requirements: Thinking Like A Pirate
 
Search engines
Search enginesSearch engines
Search engines
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
4150415
41504154150415
4150415
 

More from Daniel Tunkelang

Enterprise Intelligence
Enterprise IntelligenceEnterprise Intelligence
Enterprise Intelligence
Daniel Tunkelang
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
Daniel Tunkelang
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
Daniel Tunkelang
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedIn
Daniel Tunkelang
 
Search as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal JourneySearch as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal Journey
Daniel Tunkelang
 
Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?
Daniel Tunkelang
 
Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem
Daniel Tunkelang
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 

More from Daniel Tunkelang (20)

Query Understanding and Ecommerce
Query Understanding and EcommerceQuery Understanding and Ecommerce
Query Understanding and Ecommerce
 
Semantic Equivalence of e-Commerce Queries
Semantic Equivalence of e-Commerce QueriesSemantic Equivalence of e-Commerce Queries
Semantic Equivalence of e-Commerce Queries
 
Helping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query UnderstandingHelping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query Understanding
 
MMM, Search!
MMM, Search!MMM, Search!
MMM, Search!
 
Enterprise Intelligence
Enterprise IntelligenceEnterprise Intelligence
Enterprise Intelligence
 
Query Understanding: A Manifesto
Query Understanding: A ManifestoQuery Understanding: A Manifesto
Query Understanding: A Manifesto
 
Where should you put your data scientists?
Where should you put your data scientists?Where should you put your data scientists?
Where should you put your data scientists?
 
Data Science: A Mindset for Productivity
Data Science: A Mindset for ProductivityData Science: A Mindset for Productivity
Data Science: A Mindset for Productivity
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
 
Better Search Through Query Understanding
Better Search Through Query UnderstandingBetter Search Through Query Understanding
Better Search Through Query Understanding
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional Context
 
Find and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedInFind and be Found: Information Retrieval at LinkedIn
Find and be Found: Information Retrieval at LinkedIn
 
Search as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal JourneySearch as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal Journey
 
Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?
 
Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Information, Attention, and Trust: A Hierarchy of Needs
Information, Attention, and Trust: A Hierarchy of NeedsInformation, Attention, and Trust: A Hierarchy of Needs
Information, Attention, and Trust: A Hierarchy of Needs
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and Context
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Faceted Search Nycto Talk

  • 1. Faceted Search New York CTO Club December 9, 2009 Daniel Tunkelang, Google Otis Gospodneti!, Sematext
  • 2. Agenda Daniel: ! What is faceted search? ! Why use faceted search? ! Thoughts about design and user experience. Otis: ! What are Lucene and Solr? ! Why use an open-source search library? ! Thoughts about implementation.
  • 3. “Regular” Search Interface: ! User expresses information need as short query. ! Search engine returns ranked, pageable result set. User happy when... ! Top-ranked result satisfies information need. ! At least some result on first page is relevant. User unhappy when... ! No result on first page satisfies information need. ! Results misleadingly appear relevant (bait and switch).
  • 4. Relevance Is Subjective Relevance is defined as a measure of information conveyed by a document relative to a query. It is shown that the relationship between the document and the query, though necessary, is not sufficient to determine relevance. William Goffman, On relevance as a measure, 1964.
  • 6. Assumptions Are Dangerous ! self-awareness tf-idf PageRank ! self-expression ! model knows best ! answer is a document ! one-shot query
  • 7. What is Faceted Search? ! Best understood through examples. " See the following slides. " Or shop on almost any ecommerce site. ! Facets = multiple ways to organize information. " Often based on available structured information. " But not always, e.g., facets obtained via text mining. ! Typical interaction: " User starts with a full-text search. " Facets guide query refinement process.
  • 10. Faceted Search for Breakfast
  • 11.
  • 12. But Facets are Not a Silver Bullet... ! Screen real estate is finite. " Choose facets wisely. " Choose facet values wisely for monster facets. ! Multiple selection within a facet is powerful, but... " Has to be intuitive, especially AND vs. OR. " Even trickier for hierarchical facets. ! Search relevance still matters! " Most faceted search applications rank results. " Irrelevant results " irrelevant facet refinements.
  • 14. Deliver Precision and Recall Easier said than done! Ranking of facet values is an open research topic.
  • 15. Be Careful with Faceted Search! Cameras have artists?!
  • 17. Take-Aways ! Faceted search addresses the subjectivity of relevance and information overload. ! But deploying faceted search effectively requires that you think about user experience. ! Recommended reading: " My thin book entitled Faceted Search " Marti Hearst's book on Search User Interfaces " Peter Morville's upcoming book on Search Patterns
  • 18. Faceted Search with Lucene & Solr Otis Gospodneti!, Sematext
  • 19. What is / isn't Lucene ! Free, ASL, Java IR library, Jar ! Doug Cutting, ASF, 2001 ! Application agnostic: Indexing & Searching ! High performance, scalable ! No dependencies ! Heavily ported ! No: crawler, rich doc parser, turn-key solution ! No: out of the box faceted search-capability... but...
  • 20.
  • 21. What is/isn't Solr ! Indexing/Search server with HTTP API built on top of Lucene ! Fast & scalable (distributed search, index replication)# ! XML, JSON, Ruby, Perl, PHP, javabin ! No: crawler (but Nutch ==> Solr works)# ! Yes: rich text parser ! Yes: Faceted Search out of the box!
  • 22. Solr and Faceted Search ! 3 Types of facets: Field Values (text), Dates, Queries. ! “Text”: return counts for all/top terms in a field for a result set - e.g. categories a la Amazon ! Dates: return counts for docs in specified date ranges ! Queries: return counts for docs that also match a given query - handy for number ranges (think prices!)#
  • 23. Facet Field Requirements ! Must be indexed ! Often not tokenized ! Often not altered (lowercase, punctuation)# ! Storing not required ! Multivalued fields OK
  • 24. Turn It On ! 0 facets: ! http://host:80/solr/select?q=foo ! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category ! N facets: ! http://host:80/solr/select? q=foo&facet=true&facet.field=category&facet.field=inStock ! facet=true or facet.on
  • 25. Text Facet Response <result numFound="4" start="0"/> ! facet.mincount=1 to <lst name="facet_counts"> <lst name="facet_fields"> avoid 0-count facet <lst name="category"> values <int name="electronics">3</int> ! facet.limit=N to limit to <int name="copier">0</int> top N facet values </lst> <lst name="inStock"> ! facet.missing=true to <int name="false">3</int> catch uncategorized <int name="true">1</int> </lst> ! lots of other options! </lst> </lst>
  • 26. Date Facets ! http://.../solr/select/? q=*:*&rows=0&facet=true&facet.date=timesta mp&facet.date.start=NOW/DAY- 5DAYS&facet.date.end=NOW/DAY %2B1DAY&facet.date.gap=%2B1DAY ! (%2B1 ==> +1)# ! Solr Date Math Parser syntax: /HOUR, +2YEARS, -1DAY, /DAY+6MONTHS+3DAYS, +6MONTHS+3DAYS/DAY
  • 27. Date Facet Response <result name="response" numFound="42" start="0"/> <lst name="facet_counts"> <lst name="facet_dates"> <lst name="timestamp"> <int name="2007-08-11T00:00:00.000Z">1</int> <int name="2007-08-12T00:00:00.000Z">5</int> <int name="2007-08-13T00:00:00.000Z">3</int> <int name="2007-08-14T00:00:00.000Z">7</int> <int name="2007-08-15T00:00:00.000Z">2</int> <int name="2007-08-16T00:00:00.000Z">16</int> <str name="gap">+1DAY</str> <date name="end">2007-08-17T00:00:00Z</date> </lst>
  • 28. Query Facets ! http://.../solr/select? q=shoes&rows=0&facet=true&facet.field=inStoc k&facet.query=price: [*+TO+500]&facet.query=price:[500+TO+*] ! Avoids the bucket-at-index-time work-around ! Keep queries disjoint
  • 29. Query Facet Response <result numFound="3" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"> <int name="price:[* TO 500]">3</int> <int name="price:[500 TO *]">1</int> </lst> <lst name="facet_fields"> <lst name="inStock"> <int name="false">3</int> <int name="true">1</int> </lst> </lst> </lst>
  • 30. UI Integration ! Use Filter Queries via fq ! http://.../solr/select? q=shoes&facet=true&facet.field=category& fq=price:[0 TO 300] ! http://.../solr/select? q=shoes&facet=true&facet.field=category& fq=price:[0 TO 300]&fq=inStock:true ! Important: single request does it all
  • 31. State of Lucene & Solr ! Super healthy community, exploding development ! Lucene 3.0 – 2009-11-25: ! Performance, faster range queries, clean API, better Unicode support, more non-English support ! Solr 1.4 – 2009-11-10: ! Performance, new replication, Db indexing, rich-doc indexing, results clustering, faster response protocol, deduplication...
  • 32. Lucene, Solr, Enterprise ! Free: Community ! Lucene ~ 600 emails/month (dev: 2000/month)# ! Solr ~1300 emails/month (dev: 800/month)# ! Commercial: Support Subscriptions ! Sematext ! Lucid Imagination