Solr -

What is Solr?
 Solr is the popular open source enterprise
search platform from the Apache Lucene
project.
 Solr powers the search and navigation
features of many of the world's largest
internet sites.

Lucene
 Apache Lucene is a high-performance,
full-featured text search engine library
written entirely in Java. It is a technology
suitable for nearly any application that
requires full-text search, especially cross-
platform.

Lucene Vs Solr
 Lucene is a search library built in Java. Solr is a
web application built on top of Lucene.
 Certainly Solr = Lucene + Added features. Often
there would a question, when to choose Solr and
when to choose Lucene.
 To get more control use Lucene. For faster
development, easy to learn, choose Solr.

http://www.findbestopensource.com/article-detail/lucene-vs-solr

Why do we need Solr?
 Full-text Search
– MySQL “like %keyword%”

Too slow! And weak!

Major Features of Solr
 Advanced Full-Text Search Capabilities
 Optimized for High Volume Web Traffic
 Standards Based Open Interfaces - XML,JSON and HTTP
 Comprehensive HTML Administration Interfaces
 Server statistics exposed over JMX for monitoring
 Scalability - Efficient Replication to other Solr Search
Servers
 Flexible and Adaptable with XML configuration
 Extensible Plugin Architecture

http://lucene.apache.org/solr/

Typical Application Architecture

Cache
(memcached,
Redis, etc.)
http request
Web Server
Database
(MySQL)

DIH

Solr / Lucene

All the components could be distributed, to
make the architecture scalable.

Lucene/Solr Architecture
Request Handlers Response Writers Update Handlers
/admin /select /spell XML Binary JSON XML CSV binary

Search Components Update Processors
Query Highlighting Extracting
Signature Request
Spelling Statistics Schema Logging Handler
Faceting Debug Indexing (PDF/WORD)
More like this Clustering Apache Tika
Query
Parsing
Distributed Search Config Data Import
Handler (SQL/
Analysis
RSS)
High-
Faceting Filtering Search Caching
lighting
Index
Replication
Core Search Apache Lucene Indexing
IndexReader/Searcher Text Analysis IndexWriter
8

Demo – A live website powered by Solr

I’ll be showing you more later!

Demo – The backend of the website

Demo - Standard directory layout

Demo – Run Solr!
 java -jar start.jar
 Production enviroment:
– java -Xms200m -Xmx1400m -jar start.jar
>>/home/web_logs/solr/solr$date.log 2>&1 &
– tailf /home/web_logs/solr/solr20120423.log
2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983

Demo – Web Admin Interface
http://localhost:8983/solr/admin


• SCHEMA: This downloads the schema configuration file
(XML) directly to the browser.
• CONFIG: It is similar to the SCHEMA choice, but this is the
main configuration file for Solr.
• ANALYSIS: It is used for diagnosing potential
query/indexing problems having to do with the text analysis.
This is a somewhat advanced screen and will be discussed
later.
•SCHEMA BROWSER: This is a neat view of the schema
reflecting various heuristics of the actual data in the index.
We'll return here later.
•STATISTICS: Here you will find stats such as timing and
cache hit ratios. In Chapter 9, we will visit this screen to
evaluate Solr's performance.


• INFO: This lists static versioning information
about internal components to Solr. Frankly, it's
not very useful.

• DISTRIBUTION: It contains
Distributed/Replicated status information, only
applicable for such configurations.

• PING: Ignore this, although it can be used
for a health-check in distributed mode.

• LOGGING: This allows you to adjust the
logging levels for different parts of Solr at
runtime. For Jetty as we're running it, this
output goes to the console and nowhere else.

Query
 INFO: [core1] webapp=/solr path=/admin/ping params={}
status=0 QTime=2
 Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute

 INFO: [core1] webapp=/solr path=/select
params={wt=json&rows=100&json.nl=map&start=0&q=searchKey
word:ipad2} hits=48 status=0 QTime=0

Query
 INFO: [] webapp=/solr path=/select
params={wt=json&rows=20&json.nl=map&start=0&sort
=volume+desc&q=CId:50011744+AND+price:
[100+TO+*]} hits=1547 status=0 QTime=41

 q=CId:50011744+AND+price:[100+TO+*]
 sort=volume+desc
 start=0
 rows=20

 hits=1547 status=0 QTime=41

Query
 q - 查询字符串，必需
 fl - 指定返回那些字段内容，用逗号或空格分隔多个。
 start - 返回第一条记录在完整找到结果中的偏移位置， 0 开始，一般分页用
。
 rows - 指定返回结果最多有多少条记录，配合 start 来实现分页。
 sort - 排序，格式： sort=<field name>+<desc|asc>[,<field
name>+<desc|asc>]… 。示例：（ inStock desc, price asc ）表示先
“ inStock” 降序 , 再 “ price” 升序，默认是相关性降序。
 wt - (writer type) 指定输出格式，可以有 xml, json, php, phps, 后面 solr
1.3 增加的，要用通知我们，因为默认没有打开。
 fq - （ filter query ）过滤查询，作用：在 q 查询符合结果中同时是 fq 查询
符合的，例如： q=mm&fq=date_time:[20081001 TO 20091031] ，找关键
字 mm ，并且 date_time 是 20081001 到 20091031 之间的。

More: http://wiki.apache.org/solr/CommonQueryParameters

Indexing Data - Communicating with Solr

– Direct HTTP or a convenient client API
– Data streamed remotely or from Solr's
filesystem

Indexing Data - Data formats/sources

– Solr-XML:

– Solr-binary:
This is only supported by the SolrJ client API.

– CSV:
CSV is a character separated value format (often a comma).

– Rich documents like PDF, XLS, DOC, PPT

– Solr's DIH DataImportHandler contrib add-on is a powerful
capability that can communicate with both databases and XML
sources (for example: web services). It supports configurable
relational and schema mapping options and supports custom
transformation additions if needed. The DIH uniquely supports
delta updates if the source data has modification dates.

Lucene/Solr Indexing
PDF
<doc>
<title> HTTP POST HTTP POST

/update /update/csv /update/xml /update/extract
XML Update Extracting
XML Update CSV Update
with custom RequestHandler
Handler Handler
processor chain (PDF, Word, …)

Update Processor Chain (per handler) Text Index
Analyzers
RSS Data Import Remove Duplicates
pull
feed Handler processor
Custom Transform Lucene
Database pull processor
RSS pull Logging
pull
SQL DB Simple processor
transforms Index Lucene Index
processor

Indexing Data - Schema

 schema.xml

Advanced
 Chinese Word Segmentation ( 中文分词
)
 DIH (Data Import Handler)
 Sharding
 Replication
 Performance Tuning

Chinese Word Segmentation ( 中文分词 )

IKAnalyzer3.2.8.jar


相关原理请参阅《解密搜索引擎技术实战
》

DIH (Data Import Handler)
Most applications store data in relational databases or XML files
and searching over such data is a common use-case.

The DataImportHandler is a Solr contrib that provides a configuration driven way to
import this data into Solr in both "full builds" and using incremental delta imports.

jdbc/DIH

MySQL Solr

• full-import
• delta-import

• Imports data from databases through JDBC (Java Database Connectivity)

• Imports XML data from a URL (HTTP GET) or a file

• Can combine data from different tables or sources in various ways

• Extraction/Transformation of the data

• Import of updated (delta) data from a database, assuming a last-
updated date

• A diagnostic/development web page

• Extensible to support alternative data sources and transformation steps

• curl http://localhost:8983/solr/dataimport to verify the configuration.

• curl http://localhost:8983/solr/dataimport?command=full-import
• curl http://localhost:8983/solr/dataimport?command=delta-import

DIH (Data Import Handler) - Full Import Example 完全索引
data-config.xml

DIH (Data Import Handler) - Delta Import Example 增量索引
data-config.xml

DIH (Data Import Handler) - Demo

Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST
2011 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
cpu cores :1

MemTotal: 2058400 kB

2 millions rows imported in about 20 minutes.

Sharding
 Sharding is the process of breaking a
single logical index in a horizontal fashion
across records versus breaking it up
vertically by entities.

S1 S2 S3 S4

Sharding-Indexing
SHARDS =
['http://server1:8983/solr/',
'http://server2:8983/solr/']

unique_id = document[:id]
if unique_id.hash % SHARDS.size == local_thread_id
# index to shard
end

Sharding-Query
The ability to search across shards is built
into the query request handlers. You do
not need to do any special configuration
to activate it.

Replication

Master

Slaves

Combining replication and sharding

Sharding
M1 M2 M3 Masters

Replication

S1 S2 S3 S1 S2 S3

Slave Pool 1 Slave Pool 2

Queries sent to pools of slave shards

Combining replication and sharding

http://wiki.apache.org/solr/SolrCloud
http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html

Performance Tuning
 JVM
 http cache
 Solr Cache
 Better schema
 Better indexing strategy

Solr Caching
 Caching is a key part of what makes Solr
fast and scalable
 There are a number of different caches
configured in solrconfig.xml:
– filterCache
– queryResultCache
– documentCache

More Info
 《 Solr 1.4 Enterprise Search Server 》
 http://wiki.apache.org/solr/
 http://solr.pl/en/
 《解密搜索引擎技术实战》

Solr -

More Related Content

What's hot

Similar to Solr -

Recently uploaded

Solr -

Editor's Notes