"Best Practices" in Museum Search .. in my  (r esearched)  opinion Nate Solas #mcn2011 #search @homebrewer [emai...
Search is hard... ... shouldn't we just leave this to Google?
&quot;Leave it to Google&quot; IS a best practice! <ul><li>For them, it's a solved problem. They have absolutely solved se...
<title>We can do more to help</title> <ul><li><article> </li></ul><ul><li>Mark the content! Google indexes ALL the words, ...
Internal search: yes <ul><ul><li>We ( should ) know the most about our content, so we know: </li></ul></ul><ul><ul><ul><li...
Phases of search: <ul><li>... let's just look at three parts: </li></ul><ul><li>the  query , </li></ul><ul><li>results , <...
<ul><li>Search box, top right. Done. </li></ul><ul><li>(Powerhouse Museum has it bottom left, but they're in Australia so ...
Suggest / Autocom <ul><li>Full text autocomplete is sort of the holy grail, IMO, but we can't be as smart as Google. </li>...
Results <ul><li>Questions your result page should answer immediately: </li></ul><ul><ul><li>What are these things? </li></...
WHAT are these things? <ul><li>Mixed results (&quot;All&quot;) </li></ul><ul><li>MOMA gets it: </li></ul><ul><ul><li>http:...
...um
Why did they match (in this order)? <ul><ul><li>Highlight the match, if possible </li></ul></ul><ul><ul><li>Sort by releva...
Was I understood / Can I try again? <ul><li>MFA site:  http://www.mfa.org/search/mfa/blue </li></ul><ul><ul><li>Without th...
Was I  really  understood? <ul><li>We know what you want: &quot;Hours&quot; </li></ul><ul><ul><li>http://www.britishmuseum...
Narrow results with facets <ul><li>Awesome: </li></ul><ul><ul><li>si.edu collections </li></ul></ul><ul><ul><ul><li>  http...
Broaden results <ul><ul><li>Similar searches / More Like This </li></ul></ul><ul><ul><ul><li>  http://beta.walkerart.org/s...
Dead ends / spell check <ul><li>&quot;Did you mean?&quot; </li></ul><ul><ul><li>http://beta.walkerart.org/search/?q=absent...
Final thoughts <ul><li>Can we just spider our own pages like Google? </li></ul><ul><ul><li>Sure. Lots of tools to do this,...
So... &quot;best practices&quot; <ul><ul><li>Unified search across all content </li></ul></ul><ul><ul><ul><li>full-text se...
Let's build that! <ul><li>&quot;Solr is an open source enterprise search platform from the Apache Lucene project. Its majo...
There's a tool for you... <ul><li>http://wiki.apache.org/solr/IntegratingSolr </li></ul><ul><ul><li>ColdFusion  - ColdFusi...
Hurry, hurry! <ul><ul><li>introducing Solr </li></ul></ul><ul><ul><li>build fulltext search & introduce dismax </li></ul><...
Installation, fast test <ul><li>user:~solr$ ls </li></ul><ul><li>solr-nightly.zip </li></ul><ul><li>user:~solr$ unzip -q s...
Installation, f'realz (Ubuntu) <ul><li>apt-get install build-essential jetty  </li></ul><ul><li>     libjetty-extra openjd...
For today: <ul><li>http://172.16.0.67/ </li></ul>
Explore the fieldtypes: core0 <ul><li>Get the sample text on </li></ul><ul><li>your clipboard. </li></ul><ul><li>In core0,...
core1: fulltext search engine <ul><li>Click search on core1. Try it out. </li></ul><ul><li>(dataset is Walker Art Center e...
core1: dismax query parser <ul><li>DisMax  is an abbreviation Disjunction Max, and is a popular query mode with Solr. </li...
core1: dismax in practice <ul><li>The DisMaxQParserPlugin is designed to process simple user entered phrases (without heav...
Try some quotes <ul><li>chuck close </li></ul><ul><li>vs. </li></ul><ul><li>chuck &quot;close&quot; </li></ul><ul><li>Debu...
core2: facets <ul><li>99% chance your Solr library will abstract this for you, but it's good to know what's under the hood...
core3: Autocomplete (a) <ul><li>Read this later: </li></ul><ul><li>http://www.lucidimagination.com/blog/2009/09/08/auto-su...
Key technologies <ul><li>ShingleFilterFactory </li></ul><ul><li>Make tokens out of phrases. </li></ul><ul><li>TermsCompone...
ShingleFilterFactory <ul><li>     <fieldType name=&quot;shingle_text&quot; class=&quot;solr.TextField&quot; positionIncrem...
TermsComponent <ul><li><!-- in solrconfig.xml --> </li></ul><ul><li>         <arr name=&quot;last-components&quot;> </li><...
core4: Autocomplete (b) <ul><li>Infix. Big challenges, decent hacks. </li></ul><ul><li>Smaller shingles. </li></ul><ul><li...
core5: spellcheck <ul><li>Similar to the setup for autocomplete </li></ul><ul><li>Just remember to call a url with spellch...
Search is hard... <ul><li>Our content team (and I know the MET too with their new site) constantly struggle to understand ...
If we're bored <ul><li>ASCII / UTF8 </li></ul><ul><li>http://beta.walkerart.org/search/?q=jerome+bel </li></ul><ul><li>htt...
Upcoming SlideShare
Loading in …5
×

Best practices in museum search

3,367 views
3,278 views

Published on

Slide from my workshop at MCN2011

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,367
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
26
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Best practices in museum search

  1. 1. &quot;Best Practices&quot; in Museum Search .. in my (r esearched)  opinion Nate Solas #mcn2011 #search @homebrewer [email_address] http://bit.ly/mcn2011search
  2. 2. Search is hard... ... shouldn't we just leave this to Google?
  3. 3. &quot;Leave it to Google&quot; IS a best practice! <ul><li>For them, it's a solved problem. They have absolutely solved searching for content on websites, especially a finite domain like a museum website. </li></ul><ul><li>http://www.powerhousemuseum.com/search/index.php?cx=018242116655519399236%3A4srvv8yns7w&q=blue&sa=&cof=FORID%3A11&siteurl=www.powerhousemuseum.com%2Fvisit%2F </li></ul><ul><li>http://www.tate.org.uk/search/default.jsp?q=blue </li></ul><ul><li>http://www.brooklynmuseum.org/ </li></ul><ul><li>http://www.amnh.org/ </li></ul><ul><li>http://si.edu/  (GSA) </li></ul>
  4. 4. <title>We can do more to help</title> <ul><li><article> </li></ul><ul><li>Mark the content! Google indexes ALL the words, so all of our nav, advertising, footer... If we don't indicate what's the &quot;content&quot;, it's all fair game (sort of. They're actually smarter than that.) </li></ul><ul><li><sidebar> </li></ul><ul><li>Meta tags (OG), RDFa, valid HTML5 markup, etc. </li></ul><ul><li></sidebar> </li></ul><ul><li></article> </li></ul>
  5. 5. Internal search: yes <ul><ul><li>We ( should ) know the most about our content, so we know: </li></ul></ul><ul><ul><ul><li>how to suggest things </li></ul></ul></ul><ul><ul><ul><li>how to interpret queries in context (run the search) </li></ul></ul></ul><ul><ul><ul><li>how to present things to make sense </li></ul></ul></ul><ul><ul><ul><ul><li>It's no longer just a 'web page'! </li></ul></ul></ul></ul><ul><ul><li>  We ( should ) have the content as discrete pieces of metadata: title, date, body, author, etc. </li></ul></ul><ul><ul><ul><li>We can therefore index  just  the content, none of the other chrome on the page. </li></ul></ul></ul><ul><ul><ul><li>Facets: we can use this metadata to drill down. </li></ul></ul></ul>
  6. 6. Phases of search: <ul><li>... let's just look at three parts: </li></ul><ul><li>the query , </li></ul><ul><li>results , </li></ul><ul><li>& </li></ul><ul><li>dead ends </li></ul>
  7. 7. <ul><li>Search box, top right. Done. </li></ul><ul><li>(Powerhouse Museum has it bottom left, but they're in Australia so this makes sense. ;) </li></ul><ul><ul><li>If there's text in the box (&quot;search&quot;), clear it when they click in! </li></ul></ul><ul><ul><li>Autocomplete / suggest isn't really common (yet), but seems very useful where it shows up. </li></ul></ul><ul><ul><ul><li>Three strategies I see: </li></ul></ul></ul><ul><ul><ul><ul><li>Suggest page (&quot;live search taxonomy&quot;) ( http://www.imamuseum.org/ ) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Suggest tag/title ( http://www.vam.ac.uk/ ) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Suggest phrase from full corpus ( http://beta.walkerart.org/  (beta)) </li></ul></ul></ul></ul>The Query
  8. 8. Suggest / Autocom <ul><li>Full text autocomplete is sort of the holy grail, IMO, but we can't be as smart as Google. </li></ul><ul><li>IMA does &quot;live search&quot; (auto-suggest) instead of autocomplete, very useful but it doesn't help me spell Lichtenstein. </li></ul><ul><li>The real point is to eliminate dead ends. </li></ul>Suggest / Autocomplete
  9. 9. Results <ul><li>Questions your result page should answer immediately: </li></ul><ul><ul><li>What are these things? </li></ul></ul><ul><ul><li>Why did they match (and why in that order?) </li></ul></ul><ul><ul><li>Was I understood / can I try again easily? </li></ul></ul><ul><li>Finally: </li></ul><ul><ul><li>What's next? </li></ul></ul><ul><ul><ul><li>try some results or </li></ul></ul></ul><ul><ul><ul><li>narrow (refine) search or </li></ul></ul></ul><ul><ul><ul><li>broaden search </li></ul></ul></ul>
  10. 10. WHAT are these things? <ul><li>Mixed results (&quot;All&quot;) </li></ul><ul><li>MOMA gets it: </li></ul><ul><ul><li>http://www.moma.org/search?query=blue </li></ul></ul><ul><ul><ul><li>Full breadcrumb, excerpt, title, media if they have it. </li></ul></ul></ul><ul><li>This is confusing at first: </li></ul><ul><ul><li>http://www.metmuseum.org/search-results?ft=blue </li></ul></ul><ul><li>Separate results </li></ul><ul><li>V&A splits into sections </li></ul><ul><ul><li>http://www.vam.ac.uk/contentapi/search/?q=blue&search-submit=Go </li></ul></ul><ul><ul><ul><li>.. but some of the &quot;articles&quot; aren't articles. </li></ul></ul></ul><ul><li>MFA sections and staggers </li></ul><ul><ul><li>http://www.mfa.org/search/mfa/blue </li></ul></ul><ul><li>Careful. This sort of assumes people know what they're looking for. </li></ul>
  11. 11. ...um
  12. 12. Why did they match (in this order)? <ul><ul><li>Highlight the match, if possible </li></ul></ul><ul><ul><li>Sort by relevance </li></ul></ul><ul><ul><ul><li>(But see section on &quot;boosting&quot;...) </li></ul></ul></ul><ul><ul><li>If you're splitting up content, it's hard to explain. </li></ul></ul><ul><ul><ul><li>...best result could be at the bottom of the page </li></ul></ul></ul><ul><ul><ul><ul><li>... so ... don't. Let user do this. </li></ul></ul></ul></ul>
  13. 13. Was I understood / Can I try again? <ul><li>MFA site:  http://www.mfa.org/search/mfa/blue </li></ul><ul><ul><li>Without the URL hint, can you even tell what was searched for? </li></ul></ul><ul><ul><ul><li>And what if you want to add a single word? </li></ul></ul></ul><ul><ul><ul><ul><li>(WAC site is guilty of this. Blame the designer. ;-) </li></ul></ul></ul></ul><ul><li>A few &quot;not like this&quot; examples: </li></ul><ul><ul><li>&quot;blue phase&quot; </li></ul></ul><ul><ul><ul><li>http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Go </li></ul></ul></ul><ul><ul><ul><li>http://www.imamuseum.org/search/ima/%22blue%20phase%22 </li></ul></ul></ul><ul><li>(People are going to use quotes!) </li></ul>
  14. 14. Was I  really understood? <ul><li>We know what you want: &quot;Hours&quot; </li></ul><ul><ul><li>http://www.britishmuseum.org/search_results.aspx?searchText=hours </li></ul></ul><ul><ul><li>http://www.moma.org/search?query=hours&page=1 </li></ul></ul><ul><ul><li>http://beta.walkerart.org/search/?q=hours </li></ul></ul><ul><li>&quot;We have a special “live search” taxonomy for explicitly boosting content pages we know people are searching for. E.g. “jobs” on our employment page; “love” is our Love sculpture, not the hundreds of other works, “wedding” is for facility rentals, not our hundred wedding dresses in the collection.&quot; </li></ul><ul><li>-- Charlie Moad, IMA </li></ul><ul><li>Do me a favor: </li></ul><ul><ul><li>http://beta.walkerart.org/search/?q=articel </li></ul></ul><ul><ul><li>http://www.vam.ac.uk/contentapi/search/?suggest=article&q=articel </li></ul></ul><ul><ul><ul><li>(again, a bit confusing but right) </li></ul></ul></ul>
  15. 15. Narrow results with facets <ul><li>Awesome: </li></ul><ul><ul><li>si.edu collections </li></ul></ul><ul><ul><ul><li>  http://collections.si.edu/search/results.jsp?q=blue </li></ul></ul></ul><ul><li>Good: </li></ul><ul><ul><li>IMA </li></ul></ul><ul><ul><ul><li>http://www.imamuseum.org/search/ima/blue </li></ul></ul></ul><ul><ul><li>WAC (I'm biased) </li></ul></ul><ul><ul><ul><li>http://beta.walkerart.org/magazine/type/articles/genre/film </li></ul></ul></ul><ul><li>Less awesome: </li></ul><ul><ul><li>British Museum </li></ul></ul><ul><ul><ul><li>http://www.britishmuseum.org/search_results.aspx?searchText=blue&searchPrevious=blue&itemsPerPage=10 </li></ul></ul></ul>
  16. 16. Broaden results <ul><ul><li>Similar searches / More Like This </li></ul></ul><ul><ul><ul><li>  http://beta.walkerart.org/search/?q=absent+landlord </li></ul></ul></ul><ul><ul><ul><li>  http://www.powerhousemuseum.com/collection/database/search_tags.php?tag=blue </li></ul></ul></ul><ul><ul><ul><li>http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Go </li></ul></ul></ul><ul><ul><ul><ul><li>Sort of weird, though. </li></ul></ul></ul></ul><ul><ul><li>More Like This </li></ul></ul><ul><ul><ul><li>  We're trying it on detail pages: </li></ul></ul></ul><ul><ul><ul><ul><li>  http://beta.walkerart.org/calendar/2011/merce-cunningham-dance-company </li></ul></ul></ul></ul>
  17. 17. Dead ends / spell check <ul><li>&quot;Did you mean?&quot; </li></ul><ul><ul><li>http://beta.walkerart.org/search/?q=absent+landlord </li></ul></ul><ul><ul><li>http://www.vam.ac.uk/contentapi/search/?q=blu&search-submit=Go </li></ul></ul><ul><li>This is really just spellcheck. But it's apparently really hard, since nobody's doing it. </li></ul>
  18. 18. Final thoughts <ul><li>Can we just spider our own pages like Google? </li></ul><ul><ul><li>Sure. Lots of tools to do this, and it looks like that's how MOMA does it. </li></ul></ul><ul><ul><ul><li>However...  http://www.moma.org/search?query=%22ad+reinhardt%22+%22sum+of+days%22&page=1 </li></ul></ul></ul><ul><ul><ul><li>http://www.moma.org/search?query=blu&page=1  (look at the mp4!) </li></ul></ul></ul><ul><li>Boosting </li></ul><ul><ul><li>what kind of boosting makes sense? </li></ul></ul><ul><ul><ul><li>weight towards recent content </li></ul></ul></ul><ul><ul><ul><ul><li>push down past events, maybe </li></ul></ul></ul></ul><ul><ul><ul><li>&quot;we know what you want&quot; </li></ul></ul></ul><ul><ul><ul><ul><li>look at logs to see what people are searching for </li></ul></ul></ul></ul>
  19. 19. So... &quot;best practices&quot; <ul><ul><li>Unified search across all content </li></ul></ul><ul><ul><ul><li>full-text search with stemming, phrases, etc. </li></ul></ul></ul><ul><ul><li>Coherent, user-centric divisions of content for faceting </li></ul></ul><ul><ul><li>Prevent dead ends </li></ul></ul><ul><ul><ul><li>show #s for facets </li></ul></ul></ul><ul><ul><ul><li>autocomplete query </li></ul></ul></ul><ul><ul><li>Help the user </li></ul></ul><ul><ul><ul><li>&quot; Did you mean? &quot; </li></ul></ul></ul><ul><ul><ul><ul><li>Or just give it to them, don't ask </li></ul></ul></ul></ul>
  20. 20. Let's build that! <ul><li>&quot;Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.&quot;--  http://en.wikipedia.org/wiki/Solr </li></ul>
  21. 21. There's a tool for you... <ul><li>http://wiki.apache.org/solr/IntegratingSolr </li></ul><ul><ul><li>ColdFusion  - ColdFusion 9 now includes Apache Solr </li></ul></ul><ul><ul><li>Django  -  Haystack </li></ul></ul><ul><ul><li>Drupal  - A Drupal module that integrates Apache Solr in Drupal. </li></ul></ul><ul><ul><li>eZ Find  - eZ Find, a solid solr integration to the open source CMS eZ Publish </li></ul></ul><ul><ul><li>Forrest/Cocoon  -  SolrForrest </li></ul></ul><ul><ul><li>Foswiki  - A Foswiki plugin that integrates Apache Solr in Foswiki. </li></ul></ul><ul><ul><li>Plone  -  collective.solr </li></ul></ul><ul><ul><li>SVN  -  reposearch </li></ul></ul><ul><ul><li>TYPO3 </li></ul></ul><ul><ul><li>Various Library Catalog Applications -  Solr4Lib </li></ul></ul><ul><ul><li>Woltlab Community Framework  - A WCF package working with the burning board, the blog and all other WCF components. </li></ul></ul><ul><ul><li>WordPress  -  solr-for-wordpress  A  WordPress  plugin that replaces the default  WordPress  search with Solr. </li></ul></ul><ul><ul><li>ZooKeeperIntegration </li></ul></ul><ul><ul><li>OpenCms  -  opencms-solr </li></ul></ul>
  22. 22. Hurry, hurry! <ul><ul><li>introducing Solr </li></ul></ul><ul><ul><li>build fulltext search & introduce dismax </li></ul></ul><ul><ul><li>facets </li></ul></ul><ul><ul><li>build autocomplete </li></ul></ul><ul><ul><li>did you mean? </li></ul></ul>
  23. 23. Installation, fast test <ul><li>user:~solr$ ls </li></ul><ul><li>solr-nightly.zip </li></ul><ul><li>user:~solr$ unzip -q solr-nightly.zip </li></ul><ul><li>user:~solr$ cd solr-nightly/example/ </li></ul><ul><li>user:~/solr/example$ java -jar start.jar </li></ul><ul><li>That's it! You can actually do local development against that sort of setup and it works fine. </li></ul>
  24. 24. Installation, f'realz (Ubuntu) <ul><li>apt-get install build-essential jetty </li></ul><ul><li>    libjetty-extra openjdk-6-jdk </li></ul><ul><li>cp dist/apache-solr-3.4.0.war </li></ul><ul><li>    /usr/share/jetty/webapps/solr.war </li></ul><ul><li>cp -r example/solr /usr/share/jetty/ </li></ul><ul><li>edit /usr/share/jetty/solr/conf/schema.xml and solrconfig.xml </li></ul><ul><li>edit /etc/default/jetty: turn off no-start, make it bind to all ips, and set the java opts: </li></ul><ul><li>JAVA_OPTIONS=&quot;-Dsolr.solr.home=/usr/share/jetty/solr -Dsolr.data.dir=/usr/share/jetty/solr/data $JAVA_OPTIONS&quot; </li></ul><ul><li>/etc/init.d/jetty start </li></ul>
  25. 25. For today: <ul><li>http://172.16.0.67/ </li></ul>
  26. 26. Explore the fieldtypes: core0 <ul><li>Get the sample text on </li></ul><ul><li>your clipboard. </li></ul><ul><li>In core0, click Admin, </li></ul><ul><li>then Analysis </li></ul><ul><li>Field Names: </li></ul><ul><li>id (string) </li></ul><ul><li>text_ws </li></ul><ul><li>text_general </li></ul><ul><li>text_en </li></ul><ul><li>phonetic </li></ul><ul><li>text_general_rev </li></ul><ul><li>alphaonlysort </li></ul>
  27. 27. core1: fulltext search engine <ul><li>Click search on core1. Try it out. </li></ul><ul><li>(dataset is Walker Art Center events) </li></ul><ul><li>Click &quot;edit&quot; on core1. Discuss. </li></ul>
  28. 28. core1: dismax query parser <ul><li>DisMax  is an abbreviation Disjunction Max, and is a popular query mode with Solr. </li></ul><ul><li>Disjunction  refers to the fact that your search is executed across multiple fields, e.g. title, body and keywords, with different relevance weights </li></ul><ul><li>Max  means that if your word &quot;foo&quot; matches both title and body, the max score of these two (probably title match) is added to the score, not the sum of the two as a simple OR query would do. This gives more control over your ranking. </li></ul>
  29. 29. core1: dismax in practice <ul><li>The DisMaxQParserPlugin is designed to process simple user entered phrases (without heavy syntax) and search for the individual words across several fields using different weighting (boosts) based on the significance of each field. </li></ul><ul><li>In English: it does a really good job helping you figure out what the user meant to look for. </li></ul>
  30. 30. Try some quotes <ul><li>chuck close </li></ul><ul><li>vs. </li></ul><ul><li>chuck &quot;close&quot; </li></ul><ul><li>Debug: what's going on? </li></ul>
  31. 31. core2: facets <ul><li>99% chance your Solr library will abstract this for you, but it's good to know what's under the hood. </li></ul><ul><li>... we won't do it today, but you can facet by  queries , not just field names. </li></ul><ul><li>So you can do things like this in one call: </li></ul><ul><ul><li>Give me all events matching the query </li></ul></ul><ul><ul><li>Show how many by type (like we're doing) </li></ul></ul><ul><ul><li>Show how many are happening today </li></ul></ul><ul><ul><li>Show how many are happening &quot;this weekend&quot; </li></ul></ul><ul><ul><li>... etc. </li></ul></ul><ul><ul><li>http://beta.walkerart.org/calendar/type/free-events </li></ul></ul>
  32. 32. core3: Autocomplete (a) <ul><li>Read this later: </li></ul><ul><li>http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ </li></ul><ul><li>This is a very popular and decent solution. It really only works the way he suggests, though, by seeding with popular queries (since it starts at character 0). If you have this data, go for it, but our top queries actually aren't very interesting: &quot;jobs&quot;, &quot;staff&quot;, &quot;hours&quot;, etc. </li></ul><ul><li>We want something that can complete any phrase that occurs in our corpus (a), ideally in the middle of the phrase (b). </li></ul>
  33. 33. Key technologies <ul><li>ShingleFilterFactory </li></ul><ul><li>Make tokens out of phrases. </li></ul><ul><li>TermsComponent </li></ul><ul><li>&quot;return terms and document frequency of those terms&quot; </li></ul><ul><li>Post-processing for stopwords </li></ul><ul><li>Index them in phrases, but remove from suggestions in certain scenarios </li></ul>
  34. 34. ShingleFilterFactory <ul><li>    <fieldType name=&quot;shingle_text&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;> </li></ul><ul><li>        <analyzer type=&quot;index&quot;> </li></ul><ul><li>            <charFilter class=&quot;solr.HTMLStripCharFilterFactory&quot; /> </li></ul><ul><li>            <tokenizer class=&quot;solr.StandardTokenizerFactory&quot;/> </li></ul><ul><li>            <filter class=&quot;solr.LowerCaseFilterFactory&quot;/> </li></ul><ul><li>            <filter class=&quot;solr.ASCIIFoldingFilterFactory&quot;/> </li></ul><ul><li>            <!--<filter class=&quot;solr.StopFilterFactory&quot; ignoreCase=&quot;true&quot; words=&quot;stopwords.txt&quot; enablePositionIncrements=&quot;true&quot; />--> </li></ul><ul><li>            <filter class=&quot;solr.ShingleFilterFactory&quot; maxShingleSize=&quot;5&quot; /> </li></ul><ul><li>        </analyzer> </li></ul><ul><li>        <analyzer type=&quot;query&quot;> </li></ul><ul><li>            <charFilter class=&quot;solr.HTMLStripCharFilterFactory&quot; /> </li></ul><ul><li>            <tokenizer class=&quot;solr.StandardTokenizerFactory&quot;/> </li></ul><ul><li>            <filter class=&quot;solr.LowerCaseFilterFactory&quot;/> </li></ul><ul><li>            <filter class=&quot;solr.ASCIIFoldingFilterFactory&quot;/> </li></ul><ul><li>        </analyzer> </li></ul><ul><li>    </fieldType> </li></ul>
  35. 35. TermsComponent <ul><li><!-- in solrconfig.xml --> </li></ul><ul><li>        <arr name=&quot;last-components&quot;> </li></ul><ul><li>            <str>terms</str> </li></ul><ul><li>        </arr> </li></ul><ul><li># strict &quot;starts with&quot; </li></ul><ul><li>/select?terms=true&terms.fl=auto_text&terms.prefix=term </li></ul><ul><li>OR </li></ul><ul><li># attempt at &quot;infix&quot; (sloooow on big corpus) /select?terms=true&terms.fl=auto_text&terms.rege=(^|.* +)term.* </li></ul>
  36. 36. core4: Autocomplete (b) <ul><li>Infix. Big challenges, decent hacks. </li></ul><ul><li>Smaller shingles. </li></ul><ul><li>Less words (only title & subtitle). </li></ul><ul><li>Still... kinda slow in our beta site. Probably have to move to prefix. :( </li></ul>
  37. 37. core5: spellcheck <ul><li>Similar to the setup for autocomplete </li></ul><ul><li>Just remember to call a url with spellcheck.build=true to get things started. </li></ul><ul><li>For better results, use spellcheck.q and escape spaces. This makes it a phrase instead of spellchecking individual words and correcting them to deadends. </li></ul><ul><li>select?q=chuc+closee&spellcheck.q=chuc+closee </li></ul>
  38. 38. Search is hard... <ul><li>Our content team (and I know the MET too with their new site) constantly struggle to understand why certain results come up over others. They always ask us to make tweaks which inevitably hurt other results. It’s a constant battle for perfection and I have to do a lot of educating. </li></ul><ul><li>·         Retail results come up over artworks because they actually write good descriptions! We even set our boost on retail to 0.5. </li></ul><ul><li>·         Why does “after van Gogh” show up before the real “van Gogh”? </li></ul><ul><li>·         Why does last year’s event show up before this year’s? </li></ul><ul><li>While there are answers to all these, it’s inevitably a slippery slope. My final answer is to usually use the live search taxonomy. It is in place to tell the search engine what users are looking for specific to your institution. People just need to understand that it is a content task just as much as creating a page. </li></ul><ul><li>-- Charlie Moad, IMA </li></ul>
  39. 39. If we're bored <ul><li>ASCII / UTF8 </li></ul><ul><li>http://beta.walkerart.org/search/?q=jerome+bel </li></ul><ul><li>http://beta.walkerart.org/search/?q=J%C3%A9r%C3%B4me+Bel </li></ul><ul><li><!-- remove diacritics BEFORE stemming to match cases without diacritics --> </li></ul><ul><li><filter class=&quot;solr.ASCIIFoldingFilterFactory&quot;/> </li></ul><ul><li>boost in general, elevate.xml </li></ul><ul><li>bq=(instances:{20110927 TO *})^1000 OR (display_type:Walker Shop)^20 OR (display_type:Events)^1 </li></ul><ul><li>http://wiki.apache.org/solr/QueryElevationComponent - &quot;sponsored search&quot; </li></ul><ul><li>Index non-data resources (pdf, docs, etc.): Apache Tika </li></ul>

×