Apache Solr


Published on

This will introduce you what Apache SOLR could do and apply it for your project

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Solr

  1. 1. Enterprise search with Solr<br />Minh Tran<br />
  2. 2. Why does search matter?<br />Then:<br />Most of the data encountered created for the web<br />Heavy use of a site ‘s search function considered a failure in navigation<br />Now:<br />Navigation not always relevant<br />Less patience to browse<br />Users are used to “navigation by search box”<br />Confidential<br />2<br />
  3. 3. What is SOLR<br />Open source enterprise search platform based on Apache Lucene project.<br />REST-like HTTP/XML and JSONAPIs<br />Powerful full-text search, hit highlighting, faceted search<br />Database integration, and rich document (e.g., Word, PDF) handling<br />Dynamic clustering, distributed search and index replication<br />Loose Schema to define types and fields<br />Written in Java5, deployable as a WAR<br />Confidential<br />3<br />
  4. 4. Public Websites using Solr<br />Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix<br />See here for more information: http://wiki.apache.org/solr/PublicServers<br />Confidential<br />4<br />
  5. 5. Architecture<br />5<br />Admin<br />Interface<br />HTTP Request Servlet<br />Update Servlet<br />Standard<br />Request<br />Handler<br />Disjunction<br />Max<br />Request<br />Handler<br />Custom<br />Request<br />Handler<br />XML<br />Update <br />Interface<br />XML<br />Response<br />Writer<br />Solr Core<br />Update <br />Handler<br />Caching<br />Config<br />Schema<br />Analysis<br />Concurrency<br />Lucene<br />Replication<br />Confidential<br />
  6. 6. Starting Solr<br />We need to set these settings for SOLR:<br />solr.solr.home: SOLR home folder contains conf/solrconfig.xml<br />solr.data.dir: folder contains index folder<br />Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. <br />For e.g:<br />java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty)<br />Other web server, set these values by setting Java properties<br />Confidential<br />6<br />
  7. 7. Web Admin Interface<br />Confidential<br />7<br />
  8. 8. Confidential<br />8<br />
  9. 9. How Solr Sees the World<br />An index is built of one or more Documents<br />A Document consists of one or more Fields<br />Documents are composed of fields<br />A Field consists of a name, content, and metadata telling Solr how to handle the content.<br />You can tell Solr about the kind of data a field contains by specifying its field type<br />Confidential<br />9<br />
  10. 10. Field Analysis<br />Field analyzers are used both during ingestion, when a document is indexed, and at query time<br />An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.<br />Tokenizersbreak field data into lexical units, or tokens<br />Example: <br />Setting all letters to lowercase<br />Eliminating punctuation and accents, mapping words to their stems, and so on<br />“ram”, “Ram” and “RAM” would all match a query for “ram”<br />Confidential<br />10<br />
  11. 11. Schema.xml<br />schema.xml file located in ../solr/conf<br />schema file starts with <schema> tag<br />Solr supports one schema per deployment<br />The schema can be organized into three sections:<br />Types<br />Fields<br />Other declarations<br />11<br />
  12. 12. Example for TextField type<br />Confidential<br />12<br />
  13. 13. Filter explanation<br />StopFilterFactory: Tokenize on whitespace, then removed any common words<br />WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc.<br />LowerCaseFilterFactory: lowercase all terms.<br />EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm.<br />E.g: “runs, running, ran”  its elemental root "run"<br />RemoveDuplicatesTokenFilterFactory: Remove any duplicates:<br />Confidential<br />13<br />
  14. 14. Field Attributes<br />Indexed:<br />Indexed Fields are searchable and sortable.<br />You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. <br />Stored:<br />The contents of a stored Field are saved in the index.<br />This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search.<br />For example, many applications store pointers to the location of contents rather than the actual contents of a file.<br />Confidential<br />14<br />
  15. 15. Field Definitions<br />Field Attributes: name, type, indexed, stored, multiValued, omitNorms<br />Dynamic Fields, in the spirit of Lucene!<br /><dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/><br /><dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/><br /><dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/><br />15<br />
  16. 16. Other declaration<br /><uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated<br />defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term<br />For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply<br />Confidential<br />16<br />
  17. 17. Indexing data<br />Using curl to interact with Solr: http://curl.haxx.se/download.html<br />Here are different data formats:<br />Solr'snative XML<br />CSV (Character Separated Value)<br />Rich documents through SolrCell<br />JSON format<br />Direct Database and XML Import through Solr'sDataImportHandler<br />Confidential<br />17<br />
  18. 18. Add / Update documents<br />HTTP POST to add / update<br /><add><br /> <doc boost=“2”><br /> <field name=“article”>05991</field><br /> <field name=“title”>Apache Solr</field><br /> <field name=“subject”>An intro...</field><br /> <field name=“category”>search</field><br /> <field name=“category”>lucene</field><br /> <field name=“body”>Solr is a full...</field><br /> </doc><br /></add><br />Confidential<br />18<br />
  19. 19. Delete documents<br />Delete by Id<br /><delete><id>05591</id></delete><br />Delete by Query (multiple documents)<br /><delete><query>manufacturer:microsoft</query></delete><br />Confidential<br />19<br />
  20. 20. Commit / Optimize<br /><commit/> tells Solr that all changes made since the last commit should be made available for searching.<br /><optimize/> same as commit.<br />Merges all index segments. Restructures Lucene 's files to improve performance for searching.<br />Optimization is generally good to do when indexing has completed<br />If there are frequent updates, you should schedule optimization for low-usage times<br />An index does not need to be optimized to work properly. Optimization can be a time-consuming process.<br />Confidential<br />20<br />
  21. 21. Index XML documents<br />Use the command line tool for POSTing raw XML to a Solr<br />Other options:<br />-Ddata=[files|args|stdin]<br />-Durl=http://localhost:8983/solr/update<br />-Dcommit=yes<br />(Option default values are in red)<br />Example:<br />java -jar post.jar *.xml<br />java -Ddata=args -jar post.jar "<delete><id>42</id></delete>"<br />java -Ddata=stdin -jar post.jar<br />java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>"<br />Confidential<br />21<br />
  22. 22. Index CSV file usingHTTP POST<br />curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML.<br />Example: using HTTP-POST to send the CSV data over the network to the Solr server: <br />curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml<br />Confidential<br />22<br />
  23. 23. Index CSV usingremote streaming<br />Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. <br />Change enableRemoteStreaming="true“ in solrconfig.xml:<br /><requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/> <br /><ul><li>Example:</li></ul>java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>"<br />curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8"<br />curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true"<br />Confidential<br />23<br />
  24. 24. Index rich document withSolr Cell<br />Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others<br />Example:<br />curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html"<br />curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html)<br />Capture <div> tags separate, and then map that field to a dynamic field named foo_t:<br />curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf)<br />Confidential<br />24<br />
  25. 25. Updating a Solr Index with JSON<br />The JSON request handler needs to be configured in solrconfig.xml<br /><requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/><br />Example:<br />curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json"<br />Confidential<br />25<br />
  26. 26. Searching<br />Spellcheck<br />Editorial results replacement<br />Scaling index size with distributed search<br />Confidential<br />26<br />
  27. 27. Default Query Syntax<br />Lucene Query Syntax [; sort specification]<br />mission impossible; releaseDatedesc<br />+mission +impossible –actor:cruise<br />“mission impossible” –actor:cruise<br />title:spiderman^10 description:spiderman<br />description:“spiderman movie”~10<br />+HDTV +weight:[0 TO 100]<br />Wildcard queries: te?t, te*t, test*<br />Confidential<br />27<br />
  28. 28. Default Parameters<br />Query Arguments for HTTP GET/POST to /select<br />Confidential<br />28<br />
  29. 29. Search Results<br />http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price<br /><response><responseHeader><status>0</status><br /> <QTime>1</QTime></responseHeader><br /> <result numFound="16173" start="0"><br /> <doc> <br /> <str name="name">Apple 60 GB iPod with Video</str><br /> <float name="price">399.0</float> <br /> </doc> <br /> <doc> <br /> <str name="name">ASUS Extreme N7800GTX/2DHTV</str><br /> <float name="price">479.95</float><br /> </doc><br /> </result><br /></response><br />29<br />
  30. 30. Query response writers<br />query responses will be written using the 'wt' request parameter matching the name of a registered writer.<br />The "default" writer is the default and will be used if 'wt' is not specified in the request<br />E.g.: <br />http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true<br />Confidential<br />30<br />
  31. 31. Caching<br />IndexSearcher’s view of an index is fixed<br />Aggressive caching possible<br />Consistency for multi-query requests<br />filterCache – unordered set of document ids matching a query<br />resultCache – ordered subset of document ids matching a query<br />documentCache – the stored fields of documents<br />userCaches – application specific, custom query handlers<br />31<br />
  32. 32. Configuring Relevancy<br /><fieldtype name="text" class="solr.TextField"><br /> <analyzer><br /> <tokenizer class="solr.WhitespaceTokenizerFactory"/><br /> <filter class="solr.LowerCaseFilterFactory"/><br /> <filter class="solr.SynonymFilterFactory"<br /> synonyms="synonyms.txt“/><br /> <filter class="solr.StopFilterFactory“<br /> words=“stopwords.txt”/><br /> <filter class="solr.EnglishPorterFilterFactory" <br /> protected="protwords.txt"/><br /> </analyzer><br /></fieldtype><br />32<br />
  33. 33. Faceted Browsing Example<br />33<br />
  34. 34. Faceted Browsing<br />34<br />computer_type:PC<br />proc_manu:Intel<br />= 594<br />memory:[1GB TO *]<br />proc_manu:AMD<br />intersection Size()<br />= 382<br />computer<br />price asc<br />Search(Query,Filter[],Sort,offset,n)<br />price:[0 TO 500]<br />= 247<br />price:[500 TO 1000]<br />section of ordered results<br />= 689<br />Unordered set of all results<br />manu:Dell<br />= 104<br />DocList<br />DocSet<br />manu:HP<br />= 92<br />manu:Lenovo<br />= 75<br />Query Response<br />
  35. 35. Index optimization<br />Confidential<br />35<br />
  36. 36. Updater<br />High Availability<br />Dynamic HTML Generation<br />Appservers<br />HTTP search requests<br />Load Balancer<br />Solr Searchers<br />Index Replication<br />admin queries<br />DB<br />updates<br />updates<br />admin terminal<br />Solr Master<br />
  37. 37. Distributed and replicated Solr architecture<br />Confidential<br />37<br />
  38. 38. Index by using SolrJ<br />Confidential<br />38<br />
  39. 39. Query with SolrJ<br />Confidential<br />39<br />
  40. 40. Distributed and replicated Solrarchitecture (cont.)<br />At this time, applications must still handle the process of sending the documents to individual shards for indexing<br />The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns<br />Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents.<br />Confidential<br />40<br />
  41. 41. Advance Functionality<br />Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL)<br />Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…)<br />Support for NoSQL database like MongoDB, Cassandra?<br />41<br />
  42. 42. Other open source server<br />Sphinx<br />Elastic Search<br />Confidential<br />42<br />
  43. 43. Resources<br /><ul><li>http://wiki.apache.org/solr/UpdateCSV
  44. 44. http://wiki.apache.org/solr/ExtractingRequestHandler
  45. 45. http://lucene.apache.org/tika/
  46. 46. http://wiki.apache.org/solr/
  47. 47. Solr 1.4 Enterprise Search Server.</li></ul>43<br />
  48. 48. Resources (cont.)<br /><ul><li>http://www.ibm.com/developerworks/java/library/j-solr2/
  49. 49. http://www.ibm.com/developerworks/java/library/j-solr1/
  50. 50. http://en.wikipedia.org/wiki/Solr
  51. 51. Apache Conf Europe 2006 - Yonik Seeley
  52. 52. LucidWorksSolr Reference Guide</li></ul>Confidential<br />44<br />
  53. 53. Confidential<br />45<br />
  54. 54. Confidential<br />46<br />