Lucene简介
基于 Java 的全文信息检索工具包
Lucene不是一个完整的全文索引应擎 ,而是一个用Java
写的全文索引引擎工具包,它可以方便的嵌入到各种应用
中实现针对应用的全文索引/检索功能。
Lucene历史
Doug Cutting Created in 1999
g g
Donated to Apache in 2001
•Company Logo
www.nfschina.com
Powered by Lucene
Technorati
T h ti
Wikipedia
Internet Archive
LinkedIn
monster.com
monster com
Eclipse的帮助搜索
Jira
SouFun
IBM Omnifind Y! Edition
…
http://wiki.apache.org/lucene-java/PoweredBy
Lucene Sub-projects
Nutch
Web crawler with document parsing
Hadoop
Distributed data processor
p
Implements MapReduce
Solr
搜索机制示意图
Lucene搜索机制示意图
L
www.nfschina.com
Lucene Architecture:High Level
Main Packages
org.apache.lucene.index
org.apache.lucene.search
g p
org.apache.lucene.analysis
org.apache.lucene.document
org apache lucene document
org.apache.Lucene.queryParser
org.apache.Lucene.store
org.apache.Lucene.util
Lucene查询语句解析器-域(Field)
Lucene支持域。您可以指定在某一个域中搜索,或者就
使用默认域。域名及默认域是具体索引器实现决定的。
用法:域名+\":\"+搜索的项名。
例子:
假设某一个Lucene索引包含两个域,title和text,text是默认域。如果要查
找标题为“The Right Way”且含有“don‘t go this way”的文章,可以输入:
title:\"The Right Way\" AND text:go
或者
title:\"Do it right\" AND right
12
What Is Solr
Solr是一个基于Lucene的Java搜索引擎服务器。Solr 提供了层面搜
索、命中醒目显示并且支持多种输出格式(包括 XML/XSLT 和
JSON 格式) 它易于安装和配置 而且附带了 个基于 HTTP 的管
格式)。它易于安装和配置,而且附带了一个基于
理界面。
Solr历史
Yonik Seeley Developed at CNET
Donated to Apache in 2006
19
Solr 特点
Features
Servlet
Web Administration Interface
XML/HTTP, JSON Interfaces
Faceting
F ti
Schema to define types and fields
Hi hli hti
Highlighting
Caching
Index R li ti
I d Replication (Master / Slaves)
(M t Sl )
Pluggable
Java 5
Powered by Solr
Netflix
CNET
AOL:sports and music
AOL t d i
Shopper.com
Drupal module
GameSpot
Reddit
Instructables
http://news.com
…
http://wiki.apache.org/solr/PublicServers
管 功能
Solr管理界面功能
Show Config, Schema, Distribution info
Sh C fi S h Di t ib ti i f
Query Interface
Statistics
Caches: lookups, hits, hitratio, inserts, evictions,
size
i
RequestHandlers: requests, errors
UpdateHandler: adds, deletes, commits, optimizes
adds deletes commits
IndexReader, open-time, index-version, numDocs,
maxDocs,
Analysis Debugger
Shows tokens after each Analyzer stage
Shows token matches for query vs index
31
Schema
Lucene has no notion of a schema
Sorting - string vs. numeric
Ranges - val:42 included in val:[1 TO 5] ?
Lucene QueryParser has date-range support, but must
guess.
Defines fields, their types, p p
yp properties
Defines unique key field, default search field,
Similarity implementation
33
Solr-Add操作
HTTP POST to http://localhost:8080/solr/update/
<add>
<doc>
<field name=\"employeeId\">05991</field>
<field name=\"office\">Bridgewater</field>
<field name=\"skills\">Perl</field>
<field name=\"skills\">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>
Documents or fields can have boosts attached
Solr-Update / Delete操作
Inserting a document with already present
uniqueKey will erase the original
Deleting
By uniqueKey field
<delete><id>05991</id></delete>
By query
<delete><query>name:Anthony</query></delete>
<Commit/>
<Optimize/>
Default Parameters
Query Arguments for HTTP GET/POST to /select
param default description
q The query
start 0 Offset into the list of matches
rows 10 Number f documents t return
N b of d t to t
fl * Stored fields to return
qt standard Query type; maps to query
handler
df (schema) Default field to search
41
Solr-Search实例
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=‐
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet limit=‐
1&facet.field=cat&facet.mincount=1&facet.field=inStock
<response>
<responseHeader>
<status>0</status>
<QTime>3</QTime>
</responseHeader>
<result numFound=\"4\" start=\"0\"/>
<lst name=\"facet_counts\">
_
<lst name=\"facet_queries\"/>
<lst name=\"facet_fields\">
<lst name=\"cat\">
<int name=\"music\">1</int>
<int name=\"connector\">2</int>
i t \" t \" 2 /i t
<int name=\"electronics\">3</int>
</lst>
<lst name=\"inStock\">
<int name=\"false\">3</int>
name false >3</int>
<int name=\"true\">1</int>
</lst>
</lst>
</lst>
</response>
Solr-Search Faceting
Faceting
Available in StandardRequestHandler and
DisMaxRequestHandler
Faceted Browsing
•computer_type:PC
•proc_manu:Intel •= 594
•memory:[1GB TO *]
y[ ]
•intersection •proc_manu:AMD •= 382
•computer
•price asc Size()
•Search(Query,Filter[],Sort,offset,n)
•Search(Query Filter[] Sort offset n) •price:[0 TO 500] •= 247
•section of •Unordered •price:[500 TO 1000] •= 689
ordered
d d set of all
results results •manu:Dell •= 104
•DocList •DocSet
•manu:HP •= 92
•manu:Lenovo •= 75
•Query Response
45
Faceted Browsing
Example
p
46
Caching
IndexSearcher’s view of an index is fixed
Aggressive caching possible
Consistency for multi-query requests
filterCache – unordered set of document ids matching a query
resultCache – ordered subset of document ids matching a query
documentCache – the stored fields of documents
userCaches – application specific, custom query handlers
47
Warming for Speed
Lucene IndexReader warming
field norms, FieldCache, tii – the term index
Static Cache warming
Configurable static requests to warm new Searchers
g q
Smart Cache Warming (autowarming)
Using MRU items in the current cache to pre populate the
pre-populate
new cache
Warming in parallel with live requests
48
Configuring Relevancy
<fieldtype name=\"text\" class=\"solr.TextField\">
<analyzer>
<tokenizer class=\"solr.WhitespaceTokenizerFactory\"/>
<filter class=\"solr.LowerCaseFilterFactory\"/>
f C /
<filter class=\"solr.SynonymFilterFactory\"
synonyms=\"synonyms.txt“/>
<filter class=\"solr.StopFilterFactory“
class solr.StopFilterFactory
words=“stopwords.txt”/>
<filter class=\"solr.EnglishPorterFilterFactory\"
protected=\"protwords.txt\"/>
</analyzer>
/ l
</fieldtype>
51
copyField
Copies one field to another at index time
Usecase: Analyze same field different ways
copy into a field with a different analyzer
boost exact-case, exact-punctuation matches
language translations, thesaurus, soundex
<field name=“title” type=“text”/>
<field name=“title exact” type=“text exact” stored=“false”/>
field name title_exact type text_exact stored false /
<copyField source=“title” dest=“title_exact”/>
Usecase: I d multiple fields into single searchable
U Index lti l fi ld i t i l h bl
field
52
High Availability •Dynamic
HTML
•Appservers Generation
•HTTP search
•Load Balancer
requests
•Solr Searchers
•Index Replication
Index
•admin queries
•updates
•DB
•updates •Updater
•admin terminal •Solr Master
53
Replication
•Master •Searcher
•solr/data/index •solr/data/index
•after
mv
•new segment
•Lucene index segments
Lucene
•1. hard links •2. hard links •4. mv dir
•after
•3. rsync rsync
•solr/data/snapshot-2006062950000 •solr/data/snapshot-2006062950000-WIP
54
Resources
WWW
http://wiki.apache.org/solr/
http://www.ibm.com/developerworks/cn/java/j-solr1/
http://www.ibm.com/developerworks/cn/java/j-solr2/
http://www.xml.com/pub/a/2006/08/09/solr indexing xml with lucene
http://www.xml.com/pub/a/2006/08/09/solr-indexing-xml-with-lucene-
andrest.html?page=1
http://lucene.apache.org/java/docs/queryparsersyntax.html
http://www.blogjava.net/RongHao/archive/2007/11/06/158621.html
Mailing Lists
solr-user-subscribe@lucene.apache.org
solr-dev-subscribe@lucene.apache.org
55
0 comments
Post a comment