SlideShare a Scribd company logo
Enterprise search with Solr Minh Tran
Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2
What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3
Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4
Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update  Interface XML Response Writer Solr Core Update  Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential
Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory.  For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6
Web Admin Interface Confidential 7
Confidential 8
How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9
Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example:  Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10
Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11
Example for TextField type Confidential 12
Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran”  its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13
Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results.  Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14
Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t"  type="text“   indexed="true" stored="true"/> 15
Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16
Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17
Add / Update documents HTTP POST to add / update <add>     <doc boost=“2”>         <field name=“article”>05991</field>         <field name=“title”>Apache Solr</field>         <field name=“subject”>An intro...</field>         <field name=“category”>search</field>         <field name=“category”>lucene</field>         <field name=“body”>Solr is a full...</field>     </doc> </add> Confidential 18
Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19
Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20
Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args  -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21
Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server:  curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22
Index CSV usingremote streaming Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work.  Change enableRemoteStreaming="true“ in solrconfig.xml: <requestParsersenableRemoteStreaming="false“multipartUploadLimitInKB="2048"/>  ,[object Object],java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar "<commit/>" curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" –F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8" curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true" Confidential 23
Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"  -F tutorial=@tutorial.pdf (index pdf) Confidential 24
Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25
Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26
Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDatedesc +mission +impossible –actor:cruise “mission impossible” –actor:cruise title:spiderman^10 description:spiderman description:“spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t, te*t, test* Confidential 27
Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28
Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response><responseHeader><status>0</status>   <QTime>1</QTime></responseHeader>   <result numFound="16173" start="0">     <doc>        <str name="name">Apple 60 GB iPod with Video</str>       <float name="price">399.0</float>       </doc>      <doc>        <str name="name">ASUS Extreme N7800GTX/2DHTV</str>       <float name="price">479.95</float>     </doc>   </result> </response> 29
Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.:  http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30
Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31
Configuring Relevancy <fieldtype name="text" class="solr.TextField">  <analyzer>    <tokenizer class="solr.WhitespaceTokenizerFactory"/>    <filter class="solr.LowerCaseFilterFactory"/>    <filter class="solr.SynonymFilterFactory"              synonyms="synonyms.txt“/>    <filter class="solr.StopFilterFactory“              words=“stopwords.txt”/>    <filter class="solr.EnglishPorterFilterFactory"                 protected="protwords.txt"/>  </analyzer> </fieldtype> 32
Faceted Browsing Example 33
Faceted Browsing 34 computer_type:PC proc_manu:Intel = 594 memory:[1GB TO *] proc_manu:AMD intersection Size() = 382 computer price asc Search(Query,Filter[],Sort,offset,n) price:[0 TO 500] = 247 price:[500 TO 1000] section of ordered results = 689 Unordered set of all results manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo = 75 Query Response
Index optimization Confidential 35
Updater High Availability Dynamic HTML Generation Appservers HTTP search requests Load Balancer Solr Searchers Index Replication admin queries DB updates updates admin terminal Solr Master
Distributed and replicated Solr architecture Confidential 37
Index by using SolrJ Confidential 38
Query with SolrJ Confidential 39
Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40
Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41
Other open source server Sphinx Elastic Search Confidential 42
Resources ,[object Object]
http://wiki.apache.org/solr/ExtractingRequestHandler
http://lucene.apache.org/tika/
http://wiki.apache.org/solr/

More Related Content

What's hot

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
DataStax Academy
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
HBaseCon
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Maarten Smeets
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
Knoldus Inc.
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
Nabeel Moidu
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Shagun Rathore
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 

What's hot (20)

Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 

Viewers also liked

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
ashish0x90
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Alexandre Rafalovitch
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Alexandre Rafalovitch
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
keval dalasaniya
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
Chris Caple
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
searchbox-com
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
tomhill
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
Apache solr
Apache solrApache solr
Apache solr
Péter Király
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Lucidworks
 
Solr5
Solr5Solr5
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
DataStax
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
Ramzi Alqrainy
 

Viewers also liked (20)

Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
Apache solr
Apache solrApache solr
Apache solr
 
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and BeyondWebinar: Solr's example/files: From bin/post to /browse and Beyond
Webinar: Solr's example/files: From bin/post to /browse and Beyond
 
Solr5
Solr5Solr5
Solr5
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 

Similar to Apache Solr

Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis Platform
Leigh Dodds
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
jkljklj
jkljkljjkljklj
jkljklj
hoefo
 
SOAP Overview
SOAP OverviewSOAP Overview
SOAP Overview
elliando dias
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27
Eoin Keary
 
Intro to web services
Intro to web servicesIntro to web services
Intro to web services
Neil Ghosh
 
Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 Unconference
Elad Elrom
 
Web Services
Web ServicesWeb Services
Web Services
Gaurav Tyagi
 
Web Services
Web ServicesWeb Services
Web Services
Gaurav Tyagi
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
DEEPAK KHETAWAT
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security
Gol D Roger
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
03 form-data
03 form-data03 form-data
03 form-data
snopteck
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introduction
shaouy
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing Services
Leigh Dodds
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction
Sajindbg Dbg
 
RIA and Ajax
RIA and AjaxRIA and Ajax
RIA and Ajax
Schubert Gomes
 
An Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryAn Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and Query
Olaf Hartig
 

Similar to Apache Solr (20)

Getting Started With The Talis Platform
Getting Started With The Talis PlatformGetting Started With The Talis Platform
Getting Started With The Talis Platform
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
jkljklj
jkljkljjkljklj
jkljklj
 
SOAP Overview
SOAP OverviewSOAP Overview
SOAP Overview
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27
 
Intro to web services
Intro to web servicesIntro to web services
Intro to web services
 
Mashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 UnconferenceMashups MAX 360|MAX 2008 Unconference
Mashups MAX 360|MAX 2008 Unconference
 
Web Services
Web ServicesWeb Services
Web Services
 
Web Services
Web ServicesWeb Services
Web Services
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
03 form-data
03 form-data03 form-data
03 form-data
 
The Semantic Web An Introduction
The Semantic Web An IntroductionThe Semantic Web An Introduction
The Semantic Web An Introduction
 
SPARQLing Services
SPARQLing ServicesSPARQLing Services
SPARQLing Services
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction
 
RIA and Ajax
RIA and AjaxRIA and Ajax
RIA and Ajax
 
An Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and QueryAn Overview on PROV-AQ: Provenance Access and Query
An Overview on PROV-AQ: Provenance Access and Query
 

Apache Solr

  • 1. Enterprise search with Solr Minh Tran
  • 2. Why does search matter? Then: Most of the data encountered created for the web Heavy use of a site ‘s search function considered a failure in navigation Now: Navigation not always relevant Less patience to browse Users are used to “navigation by search box” Confidential 2
  • 3. What is SOLR Open source enterprise search platform based on Apache Lucene project. REST-like HTTP/XML and JSONAPIs Powerful full-text search, hit highlighting, faceted search Database integration, and rich document (e.g., Word, PDF) handling Dynamic clustering, distributed search and index replication Loose Schema to define types and fields Written in Java5, deployable as a WAR Confidential 3
  • 4. Public Websites using Solr Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix See here for more information: http://wiki.apache.org/solr/PublicServers Confidential 4
  • 5. Architecture 5 Admin Interface HTTP Request Servlet Update Servlet Standard Request Handler Disjunction Max Request Handler Custom Request Handler XML Update Interface XML Response Writer Solr Core Update Handler Caching Config Schema Analysis Concurrency Lucene Replication Confidential
  • 6. Starting Solr We need to set these settings for SOLR: solr.solr.home: SOLR home folder contains conf/solrconfig.xml solr.data.dir: folder contains index folder Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory. For e.g: java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty) Other web server, set these values by setting Java properties Confidential 6
  • 7. Web Admin Interface Confidential 7
  • 9. How Solr Sees the World An index is built of one or more Documents A Document consists of one or more Fields Documents are composed of fields A Field consists of a name, content, and metadata telling Solr how to handle the content. You can tell Solr about the kind of data a field contains by specifying its field type Confidential 9
  • 10. Field Analysis Field analyzers are used both during ingestion, when a document is indexed, and at query time An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes. Tokenizersbreak field data into lexical units, or tokens Example: Setting all letters to lowercase Eliminating punctuation and accents, mapping words to their stems, and so on “ram”, “Ram” and “RAM” would all match a query for “ram” Confidential 10
  • 11. Schema.xml schema.xml file located in ../solr/conf schema file starts with <schema> tag Solr supports one schema per deployment The schema can be organized into three sections: Types Fields Other declarations 11
  • 12. Example for TextField type Confidential 12
  • 13. Filter explanation StopFilterFactory: Tokenize on whitespace, then removed any common words WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc. LowerCaseFilterFactory: lowercase all terms. EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm. E.g: “runs, running, ran”  its elemental root "run" RemoveDuplicatesTokenFilterFactory: Remove any duplicates: Confidential 13
  • 14. Field Attributes Indexed: Indexed Fields are searchable and sortable. You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results. Stored: The contents of a stored Field are saved in the index. This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search. For example, many applications store pointers to the location of contents rather than the actual contents of a file. Confidential 14
  • 15. Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene! <dynamicFieldname="*_i" type="sint“ indexed="true" stored="true"/> <dynamicFieldname="*_s" type="string“ indexed="true" stored="true"/> <dynamicFieldname="*_t" type="text“ indexed="true" stored="true"/> 15
  • 16. Other declaration <uniqueKey>url</uniqueKey>: urlfield is the unique identifier, is determined a document is being added or updated defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply Confidential 16
  • 17. Indexing data Using curl to interact with Solr: http://curl.haxx.se/download.html Here are different data formats: Solr'snative XML CSV (Character Separated Value) Rich documents through SolrCell JSON format Direct Database and XML Import through Solr'sDataImportHandler Confidential 17
  • 18. Add / Update documents HTTP POST to add / update <add> <doc boost=“2”> <field name=“article”>05991</field> <field name=“title”>Apache Solr</field> <field name=“subject”>An intro...</field> <field name=“category”>search</field> <field name=“category”>lucene</field> <field name=“body”>Solr is a full...</field> </doc> </add> Confidential 18
  • 19. Delete documents Delete by Id <delete><id>05591</id></delete> Delete by Query (multiple documents) <delete><query>manufacturer:microsoft</query></delete> Confidential 19
  • 20. Commit / Optimize <commit/> tells Solr that all changes made since the last commit should be made available for searching. <optimize/> same as commit. Merges all index segments. Restructures Lucene 's files to improve performance for searching. Optimization is generally good to do when indexing has completed If there are frequent updates, you should schedule optimization for low-usage times An index does not need to be optimized to work properly. Optimization can be a time-consuming process. Confidential 20
  • 21. Index XML documents Use the command line tool for POSTing raw XML to a Solr Other options: -Ddata=[files|args|stdin] -Durl=http://localhost:8983/solr/update -Dcommit=yes (Option default values are in red) Example: java -jar post.jar *.xml java -Ddata=args -jar post.jar "<delete><id>42</id></delete>" java -Ddata=stdin -jar post.jar java -Dcommit=no -Ddata=args-jar post.jar "<delete><query>*:*</query></delete>" Confidential 21
  • 22. Index CSV file usingHTTP POST curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML. Example: using HTTP-POST to send the CSV data over the network to the Solr server: curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml Confidential 22
  • 23.
  • 24. Index rich document withSolr Cell Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others Example: curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@tutorial.html" curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F myfile=@tutorial.html (index html) Capture <div> tags separate, and then map that field to a dynamic field named foo_t: curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div" -F tutorial=@tutorial.pdf (index pdf) Confidential 24
  • 25. Updating a Solr Index with JSON The JSON request handler needs to be configured in solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> Example: curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json" Confidential 25
  • 26. Searching Spellcheck Editorial results replacement Scaling index size with distributed search Confidential 26
  • 27. Default Query Syntax Lucene Query Syntax [; sort specification] mission impossible; releaseDatedesc +mission +impossible –actor:cruise “mission impossible” –actor:cruise title:spiderman^10 description:spiderman description:“spiderman movie”~10 +HDTV +weight:[0 TO 100] Wildcard queries: te?t, te*t, test* Confidential 27
  • 28. Default Parameters Query Arguments for HTTP GET/POST to /select Confidential 28
  • 29. Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price <response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound="16173" start="0"> <doc> <str name="name">Apple 60 GB iPod with Video</str> <float name="price">399.0</float> </doc> <doc> <str name="name">ASUS Extreme N7800GTX/2DHTV</str> <float name="price">479.95</float> </doc> </result> </response> 29
  • 30. Query response writers query responses will be written using the 'wt' request parameter matching the name of a registered writer. The "default" writer is the default and will be used if 'wt' is not specified in the request E.g.: http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true Confidential 30
  • 31. Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers 31
  • 32. Configuring Relevancy <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldtype> 32
  • 34. Faceted Browsing 34 computer_type:PC proc_manu:Intel = 594 memory:[1GB TO *] proc_manu:AMD intersection Size() = 382 computer price asc Search(Query,Filter[],Sort,offset,n) price:[0 TO 500] = 247 price:[500 TO 1000] section of ordered results = 689 Unordered set of all results manu:Dell = 104 DocList DocSet manu:HP = 92 manu:Lenovo = 75 Query Response
  • 36. Updater High Availability Dynamic HTML Generation Appservers HTTP search requests Load Balancer Solr Searchers Index Replication admin queries DB updates updates admin terminal Solr Master
  • 37. Distributed and replicated Solr architecture Confidential 37
  • 38. Index by using SolrJ Confidential 38
  • 39. Query with SolrJ Confidential 39
  • 40. Distributed and replicated Solrarchitecture (cont.) At this time, applications must still handle the process of sending the documents to individual shards for indexing The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents. Confidential 40
  • 41. Advance Functionality Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL) Support for other programming languages (.Net, PHP, Ruby, Perl, Python,…) Support for NoSQL database like MongoDB, Cassandra? 41
  • 42. Other open source server Sphinx Elastic Search Confidential 42
  • 43.
  • 47. Solr 1.4 Enterprise Search Server.43
  • 48.
  • 51. Apache Conf Europe 2006 - Yonik Seeley