• Save
An Introduction to Basics of Search and Relevancy with Apache Solr
Upcoming SlideShare
Loading in...5
×
 

An Introduction to Basics of Search and Relevancy with Apache Solr

on

  • 10,624 views

The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces ...

The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/

Statistics

Views

Total Views
10,624
Views on SlideShare
10,594
Embed Views
30

Actions

Likes
9
Downloads
0
Comments
0

2 Embeds 30

http://blog.newitfarmer.com 29
http://pinterest.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

An Introduction to Basics of Search and Relevancy with Apache Solr An Introduction to Basics of Search and Relevancy with Apache Solr Presentation Transcript

  • Introduction to basics of Search and Relevancy with Apache Solr FEATURING: Mark Bennett, CTO
  • Agenda • Prerequisites: Browser Tricks • Web “Command Line” • The DisMax Parser • Boosting Formula • Explaining “Explain” • Check Your Index! • Q&A • Resources / About NIE 12/2/2009 Lucid Imagination, Inc. 2
  • Prerequisite: Some Browser Tricks 12/2/2009 Lucid Imagination, Inc. 3
  • Browsers Matter – install them all! Firefox: IE and Safari: • Default XML Rendering • Better “Explain” • (also some versions of IE) copy & paste • Lots of Plugins maintains line breaks • Better table copy and paste 12/2/2009 Lucid Imagination, Inc. 4
  • Larger Firefox “Command Line” Customize the Firefox URL box as a command line in 3 easy steps 1. Toolbar: Right Click 2. Customize… Add New Toolbar 3. URL bar ->CLICK and DRAG Lucid Imagination, Inc. 5
  • Turn off Solr HTTP Caching • Change in solrconfig.xml • Disable the http304 section • Turn it back on before you deploy! 12/2/2009 Lucid Imagination, Inc. 6
  • Understanding Solr’s “Web Command Line” 12/2/2009 Lucid Imagination, Inc. 7
  • The “Web Command Line” CLI CONCEPT SOLR EQUIVALENT • Command Prompt URL bar • -o or --foo bar ? or & and = • (spaces) + • some punctuation %nn • output XML or HTML • Command line “adapter” Curl • Script files can call URLs • Not built into Windows – try cygwin 12/2/2009 Lucid Imagination, Inc. 8
  • Solr “Command Line” • Typical Base URL • http://localhost:8983/solr/select?... • Basic Input (not counting dismax) • q = query, fq = filter query • df = default field • qt = query type (standard / dismax) • Controlling Output (lots more!!!) • debugQuery = true • wt = “what type” (actually “writer type”) • standard/XML, xslt (with tr=), javabin, json… • fl = *,score (which fields) 12/2/2009 Lucid Imagination, Inc. 9
  • Example: search for “solr” http://localhost:8983/solr/select?q=solr&debugQuery=true With Firefox you get XML output you can expand and collapse With MSIE* and Safari, not so much * Some versions 12/2/2009 Lucid Imagination, Inc. 10
  • Detailed Debug & Explain Output http://localhost:8983/solr/select?q=solr&debugQuery=true <str name="parsedquery">text:solr</str> … <lst name="explain"> <str name="SOLR1000"> 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) </str> </lst> 12/2/2009 Lucid Imagination, Inc. 11
  • A look at the DisMax query parser 12/2/2009 Lucid Imagination, Inc. 12
  • Solr DisMax: Defined • What is it? • Dis-joint text (Multiple fields) • Max-imum match (score) • How do you get it? • Configured in: • solrconfig.xml and schema.xml • Called with: • qt=dismax • Adjusted with: • mm, bf, qf, pf, qs, ps, tie 12/2/2009 Lucid Imagination, Inc. 13
  • Solr DisMax: Pros and Cons General Benefits • Multiple Fields • Multiple Relevancy Rules • Great for Freshness / Popularity Issues to be Aware of • Tie-in between schema.xml & solrconfig.xml • Trouble with some CJK (Chinese, Japanese, Korean) • Limited wildcard / field / range support • Difficult to customize and debug • Trouble with shingles • Understand mm! Lucid Imagination, Inc. 14
  • About the “dis” and the “max” Distributed across multiple fields • Breakup query into words • Each part becomes field clause • Like an OR but with extra credit Takes the Maximum of each set • Word 1 had highest score in Title • Word 2 very dense in the doc body • Adds in Tie breaker if in multiple fields Lucid Imagination, Inc. 15
  • Coming soon: Extended DisMax Improvements • Flexible case Boolean ops: AND/and, OR/or • Auto-escape punctuation & -> &, etc. • Improved Proximity Boosting (via word bigrams) • Other changes in stop words, relevancy calc, URL arguments How to get it • Post 1.4 patch, planned for 1.5 • Details + Patch in JIRA: SOLR-1553 http://issues.apache.org/jira/browse/SOLR-1553 • TBD: change URL option qt=edismax (or qt=dismax ) Lucid Imagination, Inc. 16
  • Boosting Formulas 12/2/2009 Lucid Imagination, Inc. 17
  • Boost Functions in Dismax High Level Feature • Numeric functions for scoring • sum(), product(), sqrt(), log(), etc. • Boost on recent dates, user popularity Good Combination: Reverse-Ordinal & Reciprocal • Position in index : ord(), reverse is: rord() • Larger y for smaller x: recip() How to get it • URL parameter bf = “boost function” • Configured in solrconfig.xml • See http://wiki.apache.org/solr/FunctionQuery Lucid Imagination, Inc. 18
  • “Freshness”: Boosting Recent Dates mx+c a / mx+c WIKI EXAMPLE: Position N-Position Linear Date ord() rord() (x,m,c) recip(x,m,a,c) recip( rord(creationDate), 1, 1000, 1000 ) slope m 1 1/1/2000 1 120 1120 0.89286 numerator a 1000 2/1/2000 2 119 1119 0.89366 intercept c 1000 (aka "b") 3/1/2000 3 118 1118 0.89445 1.000 … … … … … 1/1/2005 61 60 1060 0.94340 0.980 … … … … … 1/1/2009 109 12 1012 0.98814 0.960 2/1/2009 110 11 1011 0.98912 3/1/2009 111 10 1010 0.99010 0.940 4/1/2009 112 9 1009 0.99108 0.920 5/1/2009 113 8 1008 0.99206 6/1/2009 114 7 1007 0.99305 0.900 7/1/2009 115 6 1006 0.99404 8/1/2009 116 5 1005 0.99502 0.880 9/1/2009 117 4 1004 0.99602 10/1/2009 118 3 1003 0.99701 11/1/2009 119 2 1002 0.99800 12/1/2009 120 1 1001 0.99900 Lucid Imagination, Inc. 19
  • Sifting through Solr’s “Explain” output 12/2/2009 Lucid Imagination, Inc. 20
  • DisMax Example for “solr” INPUT: http://localhost:8983/solr /select?q=solr&debugQuery=true&qt=dismax DEBUG OUTPUT: (1 OF 2) <str name="parsedquery"> +DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 | manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01) DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 | text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01) FunctionQuery((top(ord(popularity)))^0.5) FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3) </str> 12/2/2009 Lucid Imagination, Inc. 21
  • DisMax explain output for a single word query <lst name="explain"> 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.125 = fieldNorm(field=text, doc=13) <str name="SOLR1000"> 1.0 = tf(termFreq(sku:solr)=1) 0.22260013 = (MATCH) weight(name:solr^1.5 0.74609417 = (MATCH) sum of: 3.6026897 = idf(docFreq=1, numDocs=26) in 13), product of: 0.4476144 = (MATCH) max plus 0.01 times others of: 1.0 = fieldNorm(field=sku, doc=13) 0.12357441 = queryWeight(name:solr^1.5), 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 1.0 = tf(termFreq(features:solr)=1) product of: 0.04119147 = queryWeight(text:solr^0.5), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.5 = boost 0.5 = boost 0.125 = fieldNorm(field=features, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26) 3.6026897 = idf(docFreq=1, numDocs=26) 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.022867065 = queryNorm 0.022867065 = queryNorm 0.12357441 = queryWeight(sku:solr^1.5), product of: 1.8013449 = (MATCH) fieldWeight(name:solr 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.5 = boost in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(name:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.5 = fieldNorm(field=name, doc=13) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 1.0 = tf(termFreq(sku:solr)=1) 0.06860119 = (MATCH) 0.09885953 = queryWeight(name:solr^1.2), product of: 3.6026897 = idf(docFreq=1, numDocs=26) FunctionQuery(top(ord(popularity))), 1.2 = boost 1.0 = fieldNorm(field=sku, doc=13) product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.22311316 = (MATCH) max plus 0.01 times others of: 6.0 = ord(popularity)=6 0.022867065 = queryNorm 0.040810023 = (MATCH) weight(features:solr^1.1 in 13), 0.5 = boost 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: product of: 0.022867065 = queryNorm 1.0 = tf(termFreq(name:solr)=1) 0.09062123 = queryWeight(features:solr^1.1), product of: 0.0067654043 = (MATCH) 3.6026897 = idf(docFreq=1, numDocs=26) 1.1 = boost FunctionQuery(1000.0/(1.0*float(top(ror 0.5 = fieldNorm(field=name, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26) d(price)))+1000.0)), product of: 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.022867065 = queryNorm 0.9861933 = 0.08238294 = queryWeight(features:solr), product of: 0.45033622 = (MATCH) fieldWeight(features:solr in 13), 1000.0/(1.0*float(rord(price)=14)+1000.0 3.6026897 = idf(docFreq=1, numDocs=26) product of: ) 0.022867065 = queryNorm 1.0 = tf(termFreq(features:solr)=1) 0.3 = boost 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 1.0 = tf(termFreq(features:solr)=1) 0.125 = fieldNorm(field=features, doc=13) </str> 3.6026897 = idf(docFreq=1, numDocs=26) 0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of: </lst> 0.125 = fieldNorm(field=features, doc=13) 0.016476588 = queryWeight(text:solr^0.2), product of: 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.2 = boost 0.12357441 = queryWeight(sku:solr^1.5), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.5 = boost 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26) 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 0.022867065 = queryNorm 1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 12/2/2009 Lucid Imagination, Inc. 22
  • “Explain” example: ... 0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 0.04119147 = queryWeight(text:solr^0.5), product of: 0.5 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.4142135 = tf(termFreq(text:solr)=2) tf (termFreq(text:solr )=2) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=text, doc=13) 0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: idf (docFreq=1,numDocs=26) 0.09885953 = queryWeight(name:solr^1.2), product of: 1.2 = boost 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: 1.0 = tf(termFreq(name:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.5 = fieldNorm(field=name, doc=13) 0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.08238294 = queryWeight(features:solr), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 1.0 = tf(termFreq(features:solr)=1) 3.6026897 = idf(docFreq=1, numDocs=26) 0.125 = fieldNorm(field=features, doc=13) ... 12/2/2009 Lucid Imagination, Inc. 23
  • Solr’s XSLT “debugger” http://localhost:8983/solr/select? q=solr &debugQuery=true &wt=xslt &tr=example.xsl &fl=*,score &qt=dismax 12/2/2009 Lucid Imagination, Inc. 24
  • Another way to view Explain data • Solr1.4 has Solritas • Various features, including toggle explain display • “Some assembly required…” http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/ Lucid Imagination, Inc. 25
  • Checking your Index and IDF 12/2/2009 Lucid Imagination, Inc. 26
  • Checking what got Indexed Bad Index = Bad Search • Check Upper / lower case and Punctuation • Bad Fields / Meta Data = Bad Facets, Filters, Sorting Use built-in Schema Browser: • Check each field • Common words = • IDF “Inverse Document Frequency” Lucid Imagination, Inc. 27
  • Check IDF w/ the Schema Browser Start at the Admin Screen: http://localhost:8983/solr/admin Schema Browser • select a field • change # to see more Lucid Imagination, Inc.
  • About NIE New Idea Engineering 12/2/2009 Lucid Imagination, Inc. 29
  • NIE Resources Newsletter & Whitepapers: Search Dev Newsgroup: www.ideaeng.com/current www.SearchDev.org Blogs: EnterpriseSearchBlog.com SearchComponentsOnline.com 12/2/2009 Lucid Imagination, Inc. 30
  • Finish Line / Q & A Review & Questions Mark Bennett mbennett@ideaeng.com main 408-446-3460 cell 408-829-6513 12/2/2009 Lucid Imagination, Inc. 31
  • Q&A These slides and a recorded presentation are available at bit.ly/SolrRelevancy 12/2/2009 Lucid Imagination, Inc.