Add Powerful Full Text Search to Your Web App with Solr
Upcoming SlideShare
Loading in...5
×
 

Add Powerful Full Text Search to Your Web App with Solr

on

  • 16,324 views

Speaker: Yonik Seeley

Speaker: Yonik Seeley

Statistics

Views

Total Views
16,324
Views on SlideShare
16,196
Embed Views
128

Actions

Likes
23
Downloads
619
Comments
0

4 Embeds 128

http://www.slideshare.net 69
http://www.scoop.it 44
http://blog.newitfarmer.com 9
http://s3.amazonaws.com 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Add Powerful Full Text Search to Your Web App with Solr Add Powerful Full Text Search to Your Web App with Solr Presentation Transcript

  • Powerful Full-Text Search with Solr Yonik Seeley yonik@apache.org Web 2.0 Expo, Berlin 8 November 2007 download at http://www.apache.org/~yonik
  • What is Lucene • High performance, scalable, full-text search library • Focus: Indexing + Searching Documents – “Document” is just a list of name+value pairs • No crawlers or document parsing • Flexible Text Analysis (tokenizers + token filters) • 100% Java, no dependencies, no config files
  • What is Solr • A full text search server based on Lucene • XML/HTTP, JSON Interfaces • Faceted Search (category counting) • Flexible data schema to define types and fields • Hit Highlighting • Configurable Advanced Caching • Index Replication • Extensible Open Architecture, Plugins • Web Administration Interface • Written in Java5, deployable as a WAR
  • Basic App HTML Indexer Webapp Document super_name: Mr. Fantastic Query Query Response name: Reed Richards (powers:agility) (matching docs) category: superhero powers: elasticity http://solr/update http://solr/select admin update select XML response writer JSON response writer Solr Servlet Container XML Update Handler Standard request handler CSV Update Handler Custom request handler Lucene
  • Indexing Data HTTP POST to http://localhost:8983/solr/update <add><doc> <field name=“id”>05991</field> <field name=“name”>Peter Parker</field> <field name=“supername”>Spider-Man</field> <field name=“category”>superhero</field> <field name=“powers”>agility</field> <field name=“powers”>spider-sense</field> </doc></add>
  • Indexing CSV data Iron Man, Tony Stark, superhero, powered armor | flight Sandman, William Baker|Flint Marko, supervillain, sand transform Wolverine,James Howlett|Logan, superhero, healing|adamantium Magneto, Erik Lehnsherr, supervillain, magnetism|electricity http://localhost:8983/solr/update/csv? fieldnames=supername,name,category,powers &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|
  • Data upload methods URL=http://localhost:8983/solr/update/csv • HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv • Multi-part file upload (browsers) • Request parameter ?stream.body=‘Cyclops, Scott Summers,…’ • Streaming from URL (must enable) ?stream.url=file://data/info.csv
  • Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new CommonsHttpSolrServer(quot;http://localhost:8983/solrquot;); SolrInputDocument doc = new SolrInputDocument(); doc.addField(quot;supernamequot;,quot;Daredevilquot;); doc.addField(quot;namequot;,quot;Matt Murdockquot;); doc.addField(“categoryquot;,“superheroquot;); server.add(doc); server.commit();
  • Deleting Documents • Delete by Id, most efficient <delete> <id>05591</id> <id>32552</id> </delete> • Delete by Query <delete> <query>category:supervillain</query> </delete>
  • Commit • <commit/> makes changes visible – Triggers static cache warming in solrconfig.xml – Triggers autowarming from existing caches • <optimize/> same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq Lucene Index Segments _0.tis _0.tii _0.prx _1.fnm _0.nrm _1.fdt _1.fdx _0_1.del […]
  • Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category <response> <result numFound=“427quot; start=quot;0quot;> <doc> <str name=“supernamequot;>Spider-Man</str> <str name=“category”>superhero</str> </doc> <doc> <str name=“supernamequot;>Msytique</str> <str name=“category”>supervillain</str> </doc> </result> </response>
  • Response Format • Add &wt=json for JSON formatted response {“resultquot;: {quot;numFoundquot;:427, quot;startquot;:0, quot;docsquot;: [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Msytique”, “category”:” supervillain”} ] } • Also Python, Ruby, PHP, SerializedPHP, XSLT
  • Scoring • Query results are sorted by score descending • VSM – Vector Space Model • tf – term frequency: numer of matching terms in field • lengthNorm – number of tokens in field • idf – inverse document frequency • coord – coordination factor, number of matching terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html
  • Explain http://solr/select?q=super fast&indent=on&debugQuery=on <lst name=quot;debugquot;> <lst name=quot;explainquot;> <str name=quot;id=Flash,internal_docid=6quot;> 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) </str> <str name=quot;id=Superman,internal_docid=7quot;> 0.1365761 = (MATCH) product of:
  • Lucene Query Syntax 1. justice league • Equiv: justice OR league • QueryParser default operator is “OR”/optional 2. +justice +league –name:aquaman • Equiv: justice AND league NOT name:aquaman 3. “justice league” –name:aquaman 4. title:spiderman^10 description:spiderman 5. description:“spiderman movie”~100
  • Lucene Query Examples2 1. releaseDate:[2000 TO 2007] 2. Wildcard searches: sup?r, su*r, super* 3. spider~ • Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7 4. *:* 5. (Superman AND “Lex Luthor”) OR (+Batman +Joker)
  • DisMax Query Syntax • Good for handling raw user queries – Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited – Separates query terms from query structure http://solr/select?qt=dismax &q=super man // the user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function
  • DisMax Query Form • The expanded Lucene Query: +( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super) DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man) ) DisjunctionMaxQuery(title:”super man”~100^2 body:”super man”~100) FunctionQuery(popularity) • Tip: set up your own request handler with default parameters to avoid clients having to specify them
  • Function Query • Allows adding function of field value to score – Boost recently added or popular documents • Current parser only supports function notation • Example: log(sum(popularity,1)) • sum, product, div, log, sqrt, abs, pow • scale(x, target_min, target_max) – calculates min & max of x across all docs • map(x, min, max, target) – useful for dealing with defaults
  • Boosted Query • Score is multiplied instead of added – New local params <!...> syntax added &q=<!boost b=sqrt(popularity)>super man • Parameter dereferencing in local params &q=<!boost b=$boost v=$userq> &boost=sqrt(popularity) &userq=super man
  • Analysis & Search Relevancy Document Indexing Analysis Query Analysis LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WordDelimiterFilter catenateWords=1 WordDelimiterFilter catenateWords=0 Lex Corp BFG 9000 Lex corp bfg 9000 LexCorp LowercaseFilter LowercaseFilter lex corp bfg 9000 lex corp bfg 9000 lexcorp A Match!
  • Configuring Relevancy <fieldType name=quot;textquot; class=quot;solr.TextFieldquot;> <analyzer> <tokenizer class=quot;solr.WhitespaceTokenizerFactoryquot;/> <filter class=quot;solr.LowerCaseFilterFactoryquot;/> <filter class=quot;solr.SynonymFilterFactoryquot; synonyms=quot;synonyms.txt“/> <filter class=quot;solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class=quot;solr.EnglishPorterFilterFactoryquot; protected=quot;protwords.txtquot;/> </analyzer> </fieldType>
  • Field Definitions • Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors <field name=quot;id“ type=quot;stringquot; indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;sku“ type=quot;textTight” indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;name“ type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/> <field name=“inStock“ type=“boolean“ indexed=quot;true“ stored=“falsequot;/> <field name=“price“ type=“sfloat“ indexed=quot;true“ stored=“falsequot;/> <field name=quot;category“ type=quot;text_ws“ indexed=quot;truequot; stored=quot;true“ multiValued=quot;truequot;/> • Dynamic Fields <dynamicField name=quot;*_iquot; type=quot;sint“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_squot; type=quot;string“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_tquot; type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/>
  • copyField • Copies one field to another at index time • Usecase #1: Analyze same field different ways – copy into a field with a different analyzer – boost exact-case, exact-punctuation matches – language translations, thesaurus, soundex <field name=“title” type=“text”/> <field name=“title_exact” type=“text_exact” stored=“false”/> <copyField source=“title” dest=“title_exact”/> • Usecase #2: Index multiple fields into single searchable field
  • Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {quot;responsequot;:{quot;numFoundquot;:26,quot;startquot;:0,quot;docsquot;:[…]}, “facet_countsquot;:{ quot;facet_queriesquot;:{ quot;price:[0 TO 100]quot;:6, “manu:IBMquot;:2}, quot;facet_fieldsquot;:{ quot;catquot;:[ quot;electronicsquot;,14, quot;memoryquot;,3, quot;cardquot;,2, quot;connectorquot;,2] }}}
  • Filters • Filters are restrictions in addition to the query • Use in faceting to narrow the results • Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…
  • Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {quot;responsequot;:{quot;numFoundquot;:5,quot;startquot;:0,quot;docsquot;:[ {quot;idquot;:quot;3007WFPquot;, “price”:899.95}, …] quot;highlightingquot;:{ quot;3007WFPquot;:{ quot;featuresquot;:[quot;30quot; TFT active matrix <em>LCD</em>, 2560 x 1600” quot;VA902Bquot;:{ quot;featuresquot;:[quot;19quot; TFT active matrix <em>LCD</em>, 8ms response time, 1280 x 1024 native resolutionquot;]}}}
  • MoreLikeThis • Selects documents that are “similar” to the documents matching the main query. &q=id:6H500F0 &mlt=true&mlt.fl=name,cat,features quot;moreLikeThisquot;:{ quot;6H500F0quot;:{quot;numFoundquot;:5,quot;startquot;:0, quot;docs”: [ {quot;namequot;:quot;Apple 60 GB iPod with Video Playback Blackquot;, quot;pricequot;:399.0, quot;inStockquot;:true, quot;popularityquot;:10, […] }, […] ] […]
  • High Availability Dynamic HTML Appservers Generation HTTP search Load Balancer requests Solr Searchers Index Replication admin queries updates updates DB Updater admin terminal Solr Master
  • Resources • WWW – http://lucene.apache.org/solr – http://lucene.apache.org/solr/tutorial.html – http://wiki.apache.org/solr/ • Mailing Lists – solr-user-subscribe@lucene.apache.org – solr-dev-subscribe@lucene.apache.org