Apache Solr

Enterprise search with SolrMinh Tran

Why does search matter?Then:Most of the data encountered created for the webHeavy use of a site ‘s search function considered a failure in navigationNow:Navigation not always relevantLess patience to browseUsers are used to “navigation by search box”Confidential2

What is SOLROpen source enterprise search platform based on Apache Lucene project.REST-like HTTP/XML and JSONAPIsPowerful full-text search, hit highlighting, faceted searchDatabase integration, and rich document (e.g., Word, PDF) handlingDynamic clustering, distributed search and index replicationLoose Schema to define types and fieldsWritten in Java5, deployable as a WARConfidential3

Public Websites using SolrMature product powering search for public sites like Digg, CNet, Zappos, and NetflixSee here for more information: http://wiki.apache.org/solr/PublicServersConfidential4

Architecture5AdminInterfaceHTTP Request ServletUpdate ServletStandardRequestHandlerDisjunctionMaxRequestHandlerCustomRequestHandlerXMLUpdate InterfaceXMLResponseWriterSolr CoreUpdate HandlerCachingConfigSchemaAnalysisConcurrencyLuceneReplicationConfidential

Starting SolrWe need to set these settings for SOLR:solr.solr.home: SOLR home folder contains conf/solrconfig.xmlsolr.data.dir: folder contains index folderOr configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. For e.g:java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty)Other web server, set these values by setting Java propertiesConfidential6

Web Admin InterfaceConfidential7

How Solr Sees the WorldAn index is built of one or more DocumentsA Document consists of one or more FieldsDocuments are composed of fieldsA Field consists of a name, content, and metadata telling Solr how to handle the content.You can tell Solr about the kind of data a field contains by specifying its field typeConfidential9

Field AnalysisField analyzers are used both during ingestion, when a document is indexed, and at query timeAn analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.Tokenizersbreak field data into lexical units, or tokensExample: Setting all letters to lowercaseEliminating punctuation and accents, mapping words to their stems, and so on“ram”, “Ram” and “RAM” would all match a query for “ram”Confidential10

Schema.xmlschema.xml file located in ../solr/confschema file starts with <schema> tagSolr supports one schema per deploymentThe schema can be organized into three sections:TypesFieldsOther declarations11

Example for TextField typeConfidential12

Filter explanationStopFilterFactory: Tokenize on whitespace, then removed any common wordsWordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc.LowerCaseFilterFactory: lowercase all terms.EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm.E.g: “runs, running, ran”  its elemental root "run"RemoveDuplicatesTokenFilterFactory: Remove any duplicates:Confidential13

Field AttributesIndexed:Indexed Fields are searchable and sortable.You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored:The contents of a stored Field are saved in the index.This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search.For example, many applications store pointers to the location of contents rather than the actual contents of a file.Confidential14

Field DefinitionsField Attributes: name, type, indexed, stored, multiValued, omitNormsDynamic Fields, in the spirit of Lucene!<dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/><dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/><dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/>15

Other declaration<uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updateddefaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query termFor e.g: q=title:Solr, If you entered q=Solr instead, the default search field would applyConfidential16

Indexing dataUsing curl to interact with Solr: http://curl.haxx.se/download.htmlHere are different data formats:Solr'snative XMLCSV (Character Separated Value)Rich documents through SolrCellJSON formatDirect Database and XML Import through Solr'sDataImportHandlerConfidential17

Add / Update documentsHTTP POST to add / update<add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr</field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”>lucene</field> <field name=“body”>Solr is a full...</field> </doc></add>Confidential18

Delete documentsDelete by Id<delete><id>05591</id></delete>Delete by Query (multiple documents)<delete><query>manufacturer:microsoft</query></delete>Confidential19

Commit / Optimize<commit/> tells Solr that all changes made since the last commit should be made available for searching.<optimize/> same as commit.Merges all index segments. Restructures Lucene 's files to improve performance for searching.Optimization is generally good to do when indexing has completedIf there are frequent updates, you should schedule optimization for low-usage timesAn index does not need to be optimized to work properly. Optimization can be a time-consuming process.Confidential20

Index XML documentsUse the command line tool for POSTing raw XML to a SolrOther options:-Ddata=[files|args|stdin]-Durl=http://localhost:8983/solr/update-Dcommit=yes(Option default values are in red)Example:java -jar post.jar *.xmljava -Ddata=args -jar post.jar "<delete><id>42</id></delete>"java -Ddata=stdin -jar post.jarjava -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>"Confidential21

Index CSV file usingHTTP POSTcurl command does this with data-binaryand an appropriate content-type header reflecting that it's XML.Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xmlConfidential22

Index CSV usingremote streamingUploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work. Change enableRemoteStreaming="true“ in solrconfig.xml:<requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/> Example:java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>"curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8"curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true"Confidential23

Index rich document withSolr CellSolr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and othersExample:curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html"curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html)Capture <div> tags separate, and then map that field to a dynamic field named foo_t:curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf)Confidential24

Updating a Solr Index with JSONThe JSON request handler needs to be configured in solrconfig.xml<requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>Example:curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json"Confidential25

SearchingSpellcheckEditorial results replacementScaling index size with distributed searchConfidential26

Default Query SyntaxLucene Query Syntax [; sort specification]mission impossible; releaseDatedesc+mission +impossible –actor:cruise“mission impossible” –actor:cruisetitle:spiderman^10 description:spidermandescription:“spiderman movie”~10+HDTV +weight:[0 TO 100]Wildcard queries: te?t, te*t, test*Confidential27

Default ParametersQuery Arguments for HTTP GET/POST to /selectConfidential28

Search Resultshttp://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result></response>29

Query response writersquery responses will be written using the 'wt' request parameter matching the name of a registered writer.The "default" writer is the default and will be used if 'wt' is not specified in the requestE.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=trueConfidential30

CachingIndexSearcher’s view of an index is fixedAggressive caching possibleConsistency for multi-query requestsfilterCache – unordered set of document ids matching a queryresultCache – ordered subset of document ids matching a querydocumentCache – the stored fields of documentsuserCaches – application specific, custom query handlers31

Configuring Relevancy<fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer></fieldtype>32

Faceted Browsing34computer_type:PCproc_manu:Intel= 594memory:[1GB TO *]proc_manu:AMDintersection Size()= 382computerprice ascSearch(Query,Filter[],Sort,offset,n)price:[0 TO 500]= 247price:[500 TO 1000]section of ordered results= 689Unordered set of all resultsmanu:Dell= 104DocListDocSetmanu:HP= 92manu:Lenovo= 75Query Response

Index optimizationConfidential35

UpdaterHigh AvailabilityDynamic HTML GenerationAppserversHTTP search requestsLoad BalancerSolr SearchersIndex Replicationadmin queriesDBupdatesupdatesadmin terminalSolr Master

Distributed and replicated Solr architectureConfidential37

Index by using SolrJConfidential38

Query with SolrJConfidential39

Distributed and replicated Solrarchitecture (cont.)At this time, applications must still handle the process of sending the documents to individual shards for indexingThe size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patternsTypically the number of documents a single machine can hold is in the range of several million up to around 100 million documents.Confidential40

Advance FunctionalityStructure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL)Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…)Support for NoSQL database like MongoDB, Cassandra?41

Other open source serverSphinxElastic SearchConfidential42

Resourceshttp://wiki.apache.org/solr/UpdateCSV

http://wiki.apache.org/solr/ExtractingRequestHandler

http://lucene.apache.org/tika/

Apache Solr

In this document

More Related Content

What's hot

Viewers also liked

Similar to Apache Solr

Apache Solr