Add Powerful Full Text Search to Your Web App with Solr

  • 12,021 views
Uploaded on

Speaker: Yonik Seeley

Speaker: Yonik Seeley

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
12,021
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
621
Comments
0
Likes
23

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Powerful Full-Text Search with Solr Yonik Seeley yonik@apache.org Web 2.0 Expo, Berlin 8 November 2007 download at http://www.apache.org/~yonik
  • 2. What is Lucene • High performance, scalable, full-text search library • Focus: Indexing + Searching Documents – “Document” is just a list of name+value pairs • No crawlers or document parsing • Flexible Text Analysis (tokenizers + token filters) • 100% Java, no dependencies, no config files
  • 3. What is Solr • A full text search server based on Lucene • XML/HTTP, JSON Interfaces • Faceted Search (category counting) • Flexible data schema to define types and fields • Hit Highlighting • Configurable Advanced Caching • Index Replication • Extensible Open Architecture, Plugins • Web Administration Interface • Written in Java5, deployable as a WAR
  • 4. Basic App HTML Indexer Webapp Document super_name: Mr. Fantastic Query Query Response name: Reed Richards (powers:agility) (matching docs) category: superhero powers: elasticity http://solr/update http://solr/select admin update select XML response writer JSON response writer Solr Servlet Container XML Update Handler Standard request handler CSV Update Handler Custom request handler Lucene
  • 5. Indexing Data HTTP POST to http://localhost:8983/solr/update <add><doc> <field name=“id”>05991</field> <field name=“name”>Peter Parker</field> <field name=“supername”>Spider-Man</field> <field name=“category”>superhero</field> <field name=“powers”>agility</field> <field name=“powers”>spider-sense</field> </doc></add>
  • 6. Indexing CSV data Iron Man, Tony Stark, superhero, powered armor | flight Sandman, William Baker|Flint Marko, supervillain, sand transform Wolverine,James Howlett|Logan, superhero, healing|adamantium Magneto, Erik Lehnsherr, supervillain, magnetism|electricity http://localhost:8983/solr/update/csv? fieldnames=supername,name,category,powers &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|
  • 7. Data upload methods URL=http://localhost:8983/solr/update/csv • HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv • Multi-part file upload (browsers) • Request parameter ?stream.body=‘Cyclops, Scott Summers,…’ • Streaming from URL (must enable) ?stream.url=file://data/info.csv
  • 8. Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new CommonsHttpSolrServer(quot;http://localhost:8983/solrquot;); SolrInputDocument doc = new SolrInputDocument(); doc.addField(quot;supernamequot;,quot;Daredevilquot;); doc.addField(quot;namequot;,quot;Matt Murdockquot;); doc.addField(“categoryquot;,“superheroquot;); server.add(doc); server.commit();
  • 9. Deleting Documents • Delete by Id, most efficient <delete> <id>05591</id> <id>32552</id> </delete> • Delete by Query <delete> <query>category:supervillain</query> </delete>
  • 10. Commit • <commit/> makes changes visible – Triggers static cache warming in solrconfig.xml – Triggers autowarming from existing caches • <optimize/> same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq Lucene Index Segments _0.tis _0.tii _0.prx _1.fnm _0.nrm _1.fdt _1.fdx _0_1.del […]
  • 11. Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category <response> <result numFound=“427quot; start=quot;0quot;> <doc> <str name=“supernamequot;>Spider-Man</str> <str name=“category”>superhero</str> </doc> <doc> <str name=“supernamequot;>Msytique</str> <str name=“category”>supervillain</str> </doc> </result> </response>
  • 12. Response Format • Add &wt=json for JSON formatted response {“resultquot;: {quot;numFoundquot;:427, quot;startquot;:0, quot;docsquot;: [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Msytique”, “category”:” supervillain”} ] } • Also Python, Ruby, PHP, SerializedPHP, XSLT
  • 13. Scoring • Query results are sorted by score descending • VSM – Vector Space Model • tf – term frequency: numer of matching terms in field • lengthNorm – number of tokens in field • idf – inverse document frequency • coord – coordination factor, number of matching terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html
  • 14. Explain http://solr/select?q=super fast&indent=on&debugQuery=on <lst name=quot;debugquot;> <lst name=quot;explainquot;> <str name=quot;id=Flash,internal_docid=6quot;> 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) </str> <str name=quot;id=Superman,internal_docid=7quot;> 0.1365761 = (MATCH) product of:
  • 15. Lucene Query Syntax 1. justice league • Equiv: justice OR league • QueryParser default operator is “OR”/optional 2. +justice +league –name:aquaman • Equiv: justice AND league NOT name:aquaman 3. “justice league” –name:aquaman 4. title:spiderman^10 description:spiderman 5. description:“spiderman movie”~100
  • 16. Lucene Query Examples2 1. releaseDate:[2000 TO 2007] 2. Wildcard searches: sup?r, su*r, super* 3. spider~ • Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7 4. *:* 5. (Superman AND “Lex Luthor”) OR (+Batman +Joker)
  • 17. DisMax Query Syntax • Good for handling raw user queries – Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited – Separates query terms from query structure http://solr/select?qt=dismax &q=super man // the user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function
  • 18. DisMax Query Form • The expanded Lucene Query: +( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super) DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man) ) DisjunctionMaxQuery(title:”super man”~100^2 body:”super man”~100) FunctionQuery(popularity) • Tip: set up your own request handler with default parameters to avoid clients having to specify them
  • 19. Function Query • Allows adding function of field value to score – Boost recently added or popular documents • Current parser only supports function notation • Example: log(sum(popularity,1)) • sum, product, div, log, sqrt, abs, pow • scale(x, target_min, target_max) – calculates min & max of x across all docs • map(x, min, max, target) – useful for dealing with defaults
  • 20. Boosted Query • Score is multiplied instead of added – New local params <!...> syntax added &q=<!boost b=sqrt(popularity)>super man • Parameter dereferencing in local params &q=<!boost b=$boost v=$userq> &boost=sqrt(popularity) &userq=super man
  • 21. Analysis & Search Relevancy Document Indexing Analysis Query Analysis LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WordDelimiterFilter catenateWords=1 WordDelimiterFilter catenateWords=0 Lex Corp BFG 9000 Lex corp bfg 9000 LexCorp LowercaseFilter LowercaseFilter lex corp bfg 9000 lex corp bfg 9000 lexcorp A Match!
  • 22. Configuring Relevancy <fieldType name=quot;textquot; class=quot;solr.TextFieldquot;> <analyzer> <tokenizer class=quot;solr.WhitespaceTokenizerFactoryquot;/> <filter class=quot;solr.LowerCaseFilterFactoryquot;/> <filter class=quot;solr.SynonymFilterFactoryquot; synonyms=quot;synonyms.txt“/> <filter class=quot;solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class=quot;solr.EnglishPorterFilterFactoryquot; protected=quot;protwords.txtquot;/> </analyzer> </fieldType>
  • 23. Field Definitions • Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors <field name=quot;id“ type=quot;stringquot; indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;sku“ type=quot;textTight” indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;name“ type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/> <field name=“inStock“ type=“boolean“ indexed=quot;true“ stored=“falsequot;/> <field name=“price“ type=“sfloat“ indexed=quot;true“ stored=“falsequot;/> <field name=quot;category“ type=quot;text_ws“ indexed=quot;truequot; stored=quot;true“ multiValued=quot;truequot;/> • Dynamic Fields <dynamicField name=quot;*_iquot; type=quot;sint“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_squot; type=quot;string“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_tquot; type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/>
  • 24. copyField • Copies one field to another at index time • Usecase #1: Analyze same field different ways – copy into a field with a different analyzer – boost exact-case, exact-punctuation matches – language translations, thesaurus, soundex <field name=“title” type=“text”/> <field name=“title_exact” type=“text_exact” stored=“false”/> <copyField source=“title” dest=“title_exact”/> • Usecase #2: Index multiple fields into single searchable field
  • 25. Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {quot;responsequot;:{quot;numFoundquot;:26,quot;startquot;:0,quot;docsquot;:[…]}, “facet_countsquot;:{ quot;facet_queriesquot;:{ quot;price:[0 TO 100]quot;:6, “manu:IBMquot;:2}, quot;facet_fieldsquot;:{ quot;catquot;:[ quot;electronicsquot;,14, quot;memoryquot;,3, quot;cardquot;,2, quot;connectorquot;,2] }}}
  • 26. Filters • Filters are restrictions in addition to the query • Use in faceting to narrow the results • Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…
  • 27. Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {quot;responsequot;:{quot;numFoundquot;:5,quot;startquot;:0,quot;docsquot;:[ {quot;idquot;:quot;3007WFPquot;, “price”:899.95}, …] quot;highlightingquot;:{ quot;3007WFPquot;:{ quot;featuresquot;:[quot;30quot; TFT active matrix <em>LCD</em>, 2560 x 1600” quot;VA902Bquot;:{ quot;featuresquot;:[quot;19quot; TFT active matrix <em>LCD</em>, 8ms response time, 1280 x 1024 native resolutionquot;]}}}
  • 28. MoreLikeThis • Selects documents that are “similar” to the documents matching the main query. &q=id:6H500F0 &mlt=true&mlt.fl=name,cat,features quot;moreLikeThisquot;:{ quot;6H500F0quot;:{quot;numFoundquot;:5,quot;startquot;:0, quot;docs”: [ {quot;namequot;:quot;Apple 60 GB iPod with Video Playback Blackquot;, quot;pricequot;:399.0, quot;inStockquot;:true, quot;popularityquot;:10, […] }, […] ] […]
  • 29. High Availability Dynamic HTML Appservers Generation HTTP search Load Balancer requests Solr Searchers Index Replication admin queries updates updates DB Updater admin terminal Solr Master
  • 30. Resources • WWW – http://lucene.apache.org/solr – http://lucene.apache.org/solr/tutorial.html – http://wiki.apache.org/solr/ • Mailing Lists – solr-user-subscribe@lucene.apache.org – solr-dev-subscribe@lucene.apache.org