Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Solr


Published on

This will introduce you what Apache SOLR could do and apply it for your project

  • Dating direct: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here

Apache Solr

  1. 1. Enterprise search with Solr<br />Minh Tran<br />
  2. 2. Why does search matter?<br />Then:<br />Most of the data encountered created for the web<br />Heavy use of a site ‘s search function considered a failure in navigation<br />Now:<br />Navigation not always relevant<br />Less patience to browse<br />Users are used to “navigation by search box”<br />Confidential<br />2<br />
  3. 3. What is SOLR<br />Open source enterprise search platform based on Apache Lucene project.<br />REST-like HTTP/XML and JSONAPIs<br />Powerful full-text search, hit highlighting, faceted search<br />Database integration, and rich document (e.g., Word, PDF) handling<br />Dynamic clustering, distributed search and index replication<br />Loose Schema to define types and fields<br />Written in Java5, deployable as a WAR<br />Confidential<br />3<br />
  4. 4. Public Websites using Solr<br />Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix<br />See here for more information:<br />Confidential<br />4<br />
  5. 5. Architecture<br />5<br />Admin<br />Interface<br />HTTP Request Servlet<br />Update Servlet<br />Standard<br />Request<br />Handler<br />Disjunction<br />Max<br />Request<br />Handler<br />Custom<br />Request<br />Handler<br />XML<br />Update <br />Interface<br />XML<br />Response<br />Writer<br />Solr Core<br />Update <br />Handler<br />Caching<br />Config<br />Schema<br />Analysis<br />Concurrency<br />Lucene<br />Replication<br />Confidential<br />
  6. 6. Starting Solr<br />We need to set these settings for SOLR:<br />solr.solr.home: SOLR home folder contains conf/solrconfig.xml<br /> folder contains index folder<br />Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. <br />For e.g:<br />java -Dsolr.solr.home=./solr -jar start.jar (Jetty)<br />Other web server, set these values by setting Java properties<br />Confidential<br />6<br />
  7. 7. Web Admin Interface<br />Confidential<br />7<br />
  8. 8. Confidential<br />8<br />
  9. 9. How Solr Sees the World<br />An index is built of one or more Documents<br />A Document consists of one or more Fields<br />Documents are composed of fields<br />A Field consists of a name, content, and metadata telling Solr how to handle the content.<br />You can tell Solr about the kind of data a field contains by specifying its field type<br />Confidential<br />9<br />
  10. 10. Field Analysis<br />Field analyzers are used both during ingestion, when a document is indexed, and at query time<br />An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.<br />Tokenizersbreak field data into lexical units, or tokens<br />Example: <br />Setting all letters to lowercase<br />Eliminating punctuation and accents, mapping words to their stems, and so on<br />“ram”, “Ram” and “RAM” would all match a query for “ram”<br />Confidential<br />10<br />
  11. 11. Schema.xml<br />schema.xml file located in ../solr/conf<br />schema file starts with <schema> tag<br />Solr supports one schema per deployment<br />The schema can be organized into three sections:<br />Types<br />Fields<br />Other declarations<br />11<br />
  12. 12. Example for TextField type<br />Confidential<br />12<br />
  13. 13. Filter explanation<br />StopFilterFactory: Tokenize on whitespace, then removed any common words<br />WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc.<br />LowerCaseFilterFactory: lowercase all terms.<br />EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm.<br />E.g: “runs, running, ran”  its elemental root "run"<br />RemoveDuplicatesTokenFilterFactory: Remove any duplicates:<br />Confidential<br />13<br />
  14. 14. Field Attributes<br />Indexed:<br />Indexed Fields are searchable and sortable.<br />You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. <br />Stored:<br />The contents of a stored Field are saved in the index.<br />This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search.<br />For example, many applications store pointers to the location of contents rather than the actual contents of a file.<br />Confidential<br />14<br />
  15. 15. Field Definitions<br />Field Attributes: name, type, indexed, stored, multiValued, omitNorms<br />Dynamic Fields, in the spirit of Lucene!<br /><dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/><br /><dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/><br /><dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/><br />15<br />
  16. 16. Other declaration<br /><uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated<br />defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term<br />For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply<br />Confidential<br />16<br />
  17. 17. Indexing data<br />Using curl to interact with Solr:<br />Here are different data formats:<br />Solr'snative XML<br />CSV (Character Separated Value)<br />Rich documents through SolrCell<br />JSON format<br />Direct Database and XML Import through Solr'sDataImportHandler<br />Confidential<br />17<br />
  18. 18. Add / Update documents<br />HTTP POST to add / update<br /><add><br /> <doc boost=“2”><br /> <field name=“article”>05991</field><br /> <field name=“title”>Apache Solr</field><br /> <field name=“subject”>An intro...</field><br /> <field name=“category”>search</field><br /> <field name=“category”>lucene</field><br /> <field name=“body”>Solr is a full...</field><br /> </doc><br /></add><br />Confidential<br />18<br />
  19. 19. Delete documents<br />Delete by Id<br /><delete><id>05591</id></delete><br />Delete by Query (multiple documents)<br /><delete><query>manufacturer:microsoft</query></delete><br />Confidential<br />19<br />
  20. 20. Commit / Optimize<br /><commit/> tells Solr that all changes made since the last commit should be made available for searching.<br /><optimize/> same as commit.<br />Merges all index segments. Restructures Lucene 's files to improve performance for searching.<br />Optimization is generally good to do when indexing has completed<br />If there are frequent updates, you should schedule optimization for low-usage times<br />An index does not need to be optimized to work properly. Optimization can be a time-consuming process.<br />Confidential<br />20<br />
  21. 21. Index XML documents<br />Use the command line tool for POSTing raw XML to a Solr<br />Other options:<br />-Ddata=[files|args|stdin]<br />-Durl=http://localhost:8983/solr/update<br />-Dcommit=yes<br />(Option default values are in red)<br />Example:<br />java -jar post.jar *.xml<br />java -Ddata=args -jar post.jar "<delete><id>42</id></delete>"<br />java -Ddata=stdin -jar post.jar<br />java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>"<br />Confidential<br />21<br />
  22. 22. Index CSV file usingHTTP POST<br />curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML.<br />Example: using HTTP-POST to send the CSV data over the network to the Solr server: <br />curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml<br />Confidential<br />22<br />
  23. 23. Index CSV usingremote streaming<br />Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. <br />Change enableRemoteStreaming="true“ in solrconfig.xml:<br /><requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/> <br /><ul><li>Example:</li></ul>java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>"<br />curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8"<br />curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true"<br />Confidential<br />23<br />
  24. 24. Index rich document withSolr Cell<br />Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others<br />Example:<br />curl "http://localhost:9090/solr/update/extract?" -F "myfile=@tutorial.html"<br />curl "http://localhost:9090/solr/update/extract?" -F myfile=@tutorial.html (index html)<br />Capture <div> tags separate, and then map that field to a dynamic field named foo_t:<br />curl "http://localhost:9090/solr/update/extract?" -F tutorial=@tutorial.pdf (index pdf)<br />Confidential<br />24<br />
  25. 25. Updating a Solr Index with JSON<br />The JSON request handler needs to be configured in solrconfig.xml<br /><requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/><br />Example:<br />curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json"<br />Confidential<br />25<br />
  26. 26. Searching<br />Spellcheck<br />Editorial results replacement<br />Scaling index size with distributed search<br />Confidential<br />26<br />
  27. 27. Default Query Syntax<br />Lucene Query Syntax [; sort specification]<br />mission impossible; releaseDatedesc<br />+mission +impossible –actor:cruise<br />“mission impossible” –actor:cruise<br />title:spiderman^10 description:spiderman<br />description:“spiderman movie”~10<br />+HDTV +weight:[0 TO 100]<br />Wildcard queries: te?t, te*t, test*<br />Confidential<br />27<br />
  28. 28. Default Parameters<br />Query Arguments for HTTP GET/POST to /select<br />Confidential<br />28<br />
  29. 29. Search Results<br />http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price<br /><response><responseHeader><status>0</status><br /> <QTime>1</QTime></responseHeader><br /> <result numFound="16173" start="0"><br /> <doc> <br /> <str name="name">Apple 60 GB iPod with Video</str><br /> <float name="price">399.0</float> <br /> </doc> <br /> <doc> <br /> <str name="name">ASUS Extreme N7800GTX/2DHTV</str><br /> <float name="price">479.95</float><br /> </doc><br /> </result><br /></response><br />29<br />
  30. 30. Query response writers<br />query responses will be written using the 'wt' request parameter matching the name of a registered writer.<br />The "default" writer is the default and will be used if 'wt' is not specified in the request<br />E.g.: <br />http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true<br />Confidential<br />30<br />
  31. 31. Caching<br />IndexSearcher’s view of an index is fixed<br />Aggressive caching possible<br />Consistency for multi-query requests<br />filterCache – unordered set of document ids matching a query<br />resultCache – ordered subset of document ids matching a query<br />documentCache – the stored fields of documents<br />userCaches – application specific, custom query handlers<br />31<br />
  32. 32. Configuring Relevancy<br /><fieldtype name="text" class="solr.TextField"><br /> <analyzer><br /> <tokenizer class="solr.WhitespaceTokenizerFactory"/><br /> <filter class="solr.LowerCaseFilterFactory"/><br /> <filter class="solr.SynonymFilterFactory"<br /> synonyms="synonyms.txt“/><br /> <filter class="solr.StopFilterFactory“<br /> words=“stopwords.txt”/><br /> <filter class="solr.EnglishPorterFilterFactory" <br /> protected="protwords.txt"/><br /> </analyzer><br /></fieldtype><br />32<br />
  33. 33. Faceted Browsing Example<br />33<br />
  34. 34. Faceted Browsing<br />34<br />computer_type:PC<br />proc_manu:Intel<br />= 594<br />memory:[1GB TO *]<br />proc_manu:AMD<br />intersection Size()<br />= 382<br />computer<br />price asc<br />Search(Query,Filter[],Sort,offset,n)<br />price:[0 TO 500]<br />= 247<br />price:[500 TO 1000]<br />section of ordered results<br />= 689<br />Unordered set of all results<br />manu:Dell<br />= 104<br />DocList<br />DocSet<br />manu:HP<br />= 92<br />manu:Lenovo<br />= 75<br />Query Response<br />
  35. 35. Index optimization<br />Confidential<br />35<br />
  36. 36. Updater<br />High Availability<br />Dynamic HTML Generation<br />Appservers<br />HTTP search requests<br />Load Balancer<br />Solr Searchers<br />Index Replication<br />admin queries<br />DB<br />updates<br />updates<br />admin terminal<br />Solr Master<br />
  37. 37. Distributed and replicated Solr architecture<br />Confidential<br />37<br />
  38. 38. Index by using SolrJ<br />Confidential<br />38<br />
  39. 39. Query with SolrJ<br />Confidential<br />39<br />
  40. 40. Distributed and replicated Solrarchitecture (cont.)<br />At this time, applications must still handle the process of sending the documents to individual shards for indexing<br />The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns<br />Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents.<br />Confidential<br />40<br />
  41. 41. Advance Functionality<br />Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL)<br />Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…)<br />Support for NoSQL database like MongoDB, Cassandra?<br />41<br />
  42. 42. Other open source server<br />Sphinx<br />Elastic Search<br />Confidential<br />42<br />
  43. 43. Resources<br /><ul><li>
  44. 44.
  45. 45.
  46. 46.
  47. 47. Solr 1.4 Enterprise Search Server.</li></ul>43<br />
  48. 48. Resources (cont.)<br /><ul><li>
  49. 49.
  50. 50.
  51. 51. Apache Conf Europe 2006 - Yonik Seeley
  52. 52. LucidWorksSolr Reference Guide</li></ul>Confidential<br />44<br />
  53. 53. Confidential<br />45<br />
  54. 54. Confidential<br />46<br />