20150210 solr introdution

• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)

• Solr是一個全文檢索系統，可針對大量文件的內容，可輸入任意字詞的
關鍵字及其邏輯運算(AND、OR 、NOT)等，進行快速內文查詢，並提
供查詢結果，依其文件符合程度的評分排序或文件相關資訊分類，以
便進一步進行統計、分析及彙整的系統。常見的全文檢索的資料對象
有新聞、文件報告、期刋、書籍或是網站內容等。
• Solr採用索引的方法，也就是先將文件內容切割出字詞單元(token)，再
將這些字詞以「雜湊表」或「B+樹」等資料結構，建立索引檔，紀錄
其文件編號及在文件中出現的位置。在進行查詢時，系統先將輸入的
字串，進行字詞單元分析，再將這些字詞一一使用索引快速搜尋，接
著將結果依輸入的條件進行邏輯運算，並依在文件中出現的次數等關
係計算各結果的權重，最後排序輸出結果。

• 某些資料庫系統中，例如MS SQL與Oracle皆有內建簡易的全文檢索功
能。但是要進行較為複雜客製化的全文檢索功能，就捉襟見肘了，必
須另外採用專業的全文檢索引擎才能達成。目前開放源碼的專案中已
有十年歷史的Lucene是個不錯的選擇。但缺點是Lucene是以程式庫的方
式提供，必需以Java語言撰寫程式才能取用，且功能繁瑣，學習期長，
並不易切入進行實作。
• Solr以Lucene 程式庫為核心，進行全文資料的索引建立和搜尋執行；並
且以HTTP協定的方式提供Web service，方便各種程式語言呼叫。Solr提
供強大的設定檔，在不需編寫Java碼的情況下，就可以配置供一般全
文檢索用途使用；有特殊需求時，亦可依循其插件（plugin）架構，編
寫擴充功能。

• Solr主要的特性包括全文檢索、命中標示（hit highlighting）、層面檢索
（faceted search）、動態群聚分類（clustering）、資料庫整合、文件
（如WORD，PDF檔）處理及空間資料搜尋。Solr具有高度擴充性的架
構，提供分散式檢索及索引資料庫複製等功能。許多的大型網站，採
用Solr提供搜尋及瀏覽操作的功能。

Solr具有以下特性：
• 使用設定檔定義資料Schema，包含數字類別、動態欄位及唯一鍵（Unique Key）
• 擴充Lucene的查詢語法
• 層面分類搜尋及縮小範圍過濾功能
• 地理空間資料搜尋
• 使用設定檔就可設定原始資料文本分析（tokenize）及過濾（字幹處理、停用字）
• 可設定的搜尋結果快取（Cache）
• 搜尋效能優化
• 使用XML格式系統設定檔
• 提供系統管理用界面
• 提供監視用的系統記錄（Log）
• 快速增量式更新及索引複製
• 具有可跨數個主機的調配式分散搜尋
• 進行索引的原始資料，可使用JSON、XML、CSV /符號分隔的文字檔和二進制檔等格式
• 可從本地磁碟或HTTP來源，取得資料庫或XML文件資料，進行索引
• 使用Apache Tika進行豐富文件（PDF，WORD，HTML等）解析及索引
• 支援多個資料索引
• 支援多國語言資料分析

• Solr已將全文檢索伺服器包裝成只要經由修改系統設定及schema定義二
個XML格式的文字檔，就可運作。以MVC系統架構觀點而言，Solr已提
供了Model的功能，應用系統只要將心力專注於View的畫面配置與UI設
計，以及Controller的request的HTTP URL參數編排、回傳結果的
XML/JSON內容解析就可以完成系統了。Solr大幅簡化全文檢索系統開
發的複雜度。

• Solr is not meant to entirely replace your RDBMS but rather complement it
• One of the things that Solr does best is to answer a question such as: “What
is a list of the most relevant documents or fields, that possibly match query
‘XYZ’? ” In this case, your “XYZs” might match a bunch of documents or
records with fields that Solr may have tokenized, stored, queried and/or
ranked to produce a list of result documents.
• Solr also provides the ability to quickly filter results by facet fields enriching
the search experience so that your users can narrow the list to find the right
item or set of items based on faceted fields.

• By contrast, the RDBMS in its classic implementation is meant to answer
questions such as: exact match queries, e.g., “give me all records in my users
table with a creation date after Oct 1, 2009”; or, reporting-related queries,
like “what is the average file size of images uploaded to my photo site
grouped by user and date”
• There is one other thing the RDBMS does quite well: efficiently executing a
series of inserts and updates for a transaction, rolling back if one of those
operations failed (also known as ACID properties: Atomicity, Consistency,
Isolation, Durability).
• The best way to think about Solr is that it’s a quickly searchable view of your
data. A well-designed application can use the best of both these
approaches, utilizing Solr to help users find the most relevant documents and
then use your RDBMS to query for more precise additional information to
better present the results to the end user.

collection
shard1 192.xxx.xxx.111
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Physical nodes
SolrTerminology description
Collection A complete logical index in a SolrCloud cluster.
It is associated with a config set and is made up of one or more shards
Config Set A set of config files necessary for a collection to function properly.
At minimum this will consist of solrconfig.xml and schema.xml, but
depending on the contents of those two files, may include other files
Config Set
logical

Shard A logical piece (or slice) of a collection.
Each shard is made up of one or more replicas.
An election is held (by zookeeper) to determine which replica is the
leader.
Replica One copy of a shard. One of them will be elected to be the leader
collection
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set
Physical nodeslogical

Shard Leader The shard replica that has won the leader election.
Elections can happen at any time, but normally they are only triggered by
events like a Solr instance going down.
When documents are indexed, SolrCloud will forward them to the leader
of the shard, and the leader will distribute them to all the shard replicas.
collection
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set

Zookeeper SolrCloud requires Zookeeper to handles leader elections.
It is recommended that it be standalone, installed separately from Solr.
A majority of servers are needed to provide service (e.g., 5 zookeeper
servers are needed to allow for the failure of up to 2 servers at a time.).
Zookeeper can run on the same hardware as Solr, and many users do run
it on the same hardware.
collection
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set

This is a query respose
This is a solr doc
These are fields defined in schema.xml

• The content of schema.xml looks roughly like this:
<schema name="your name" version="x.x">
<types> </types>
<fields> </fields>
<uniqueKey> </uniqueKey>
</schema>
Field types
Fields
A field

• The content of schema.xml looks roughly like this:
<schema name="your name" version="x.x">
<types> </types>
<fields> </fields>
<uniqueKey> </uniqueKey>
</schema>
Field types
Fields
A field
• Attribute "name" is the name of this schema and is only used for display
purposes.
• Version="x.y" is Solr's version number for the schema syntax and semantics.
It should not normally be changed by applications.
• Field “uniqueKey” is used to determine and enforce document uniqueness.
This is not required.
But if you don’t need it, you’d better have a good reason.

• An example:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
</analyzer>
</fieldType>

Property Description values
name The name of the fieldType. It is strongly recommended that names
consist of alphanumeric or underscore characters only and not start
with a digit.
class The class name that gets used to store and index the data for this
type.
indexed If true, then this filed is searchable, sortable, and facetable. true/false
stored If true, the actual value of the field can be retrieved by queries true/false
required Instructs Solr to reject any attempts to add a document which does
not have a value for this field. This property defaults to false.
true/false
multiValued If true, indicates that a single document might contain multiple
values for this field type
true/false
• General Properties:

Property Description values
positionIncrementGap For multivalued fields, specifies a distance between
multiple values, which prevents spurious phrase
matches
Integer
autoGeneratePhraseQueri
es
For text fields. If true, Solr automatically generates
phrase queries for adjacent terms. If false, terms must
be enclosed in double-quotes to be treated as phrases.
true/false
docValues If true, the value of the field will be put in a column-
oriented DocValues structure (this is for performance
boost in sorting, faceting, highlighting)
true/false
sortMissingFirst
sortMissingLast
If sortMissingLast="true", then a sort on this field
will cause documents without the field to come after
documents with the field, regardless of the requested
sort order (asc or desc).
true/false
• General Properties:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
</analyzer>
</fieldType>
• This is a general text field that has reasonable, generic cross-language defaults: it
tokenizes with StandardTokenizer, removes stop words from case-insensitive
"stopwords.txt"(empty by default), and down cases.
• At query time only, it also applies synonyms.

<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
• Dynamic fields allow Solr to index fields that you did not explicitly define in your
schema. This is useful if you discover you have forgotten to define one or more fields.
(We note that changing schema in solr is very costly!)
• A dynamic field is just like a regular field except it has a name with a wildcard in it.
• For example, suppose your schema includes a dynamic field with a name of *_i.
If you attempt to index a document with a cost_i field, but no explicit cost_i field is
defined in the schema, then the cost_i field will have the field type and analysis
defined for *_i.
• In practice, we put dynamic fields in schema.xml for all data types for greatest schema
flexibility.
• An example:

<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<copyField source="manu" dest="text"/>
<copyField source="features" dest="text"/>
• A copyField copies one field to another at the time a document is added to the index.
In the above example, content in four fields (cat, name, manu, features) will be copied
to field text.
• When to use copyField?
1. If you want to provide a default search field that essentially search several fields at
the same time when a query comes to that default field.
2. If you want to send the same data to two different field at the same time.
• An example:

<copyField source="author" dest="author_s"/>

Input query
syntax here
Field asc|desc
Range of results
Fields to be
returned
Return format
Search result

• fq: This parameter can be used to specify a query that can be used to
restrict the super set of documents that can be returned, without influencing
score.
• It can be very useful for speeding up complex queries since the queries
specified with fq are cached independently from the main query. Caching
means the same filter is used again for a later query.
• An example:
http://localhost:8983/solr/select?
q=cars
&fq=color:black
&fq=model:Lamborghini
&fq=year:[2014 TO *]
• By default, Solr resolves all of the filters before the main query. Each filter
query is looked up individually in Solr’s filterCache

• The following parameters are used for spatial search:
Parameter Description
d the radial distance, in kilometers
pt the center point using the format "lat,lon" if latitude & longitude.
sfield a spatial indexed field
• geofilt: For example, to find all documents within five kilometers of a given
lat/lon point, you could enter
&q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5

• bbox: very similar to geofilt except it uses the bounding box of
the calculated circle. Here's a sample query:
• The rectangular shape is faster to compute and so it's sometimes
used as an alternative to geofilt when it's acceptable to return
points outside of the radius.
&q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5

• geodist: a distance function that takes three optional
parameters: (sfield,latitude,longitude). You can use the geodist
function to sort results by distance or score return results.
• For example, to sort your results by ascending distance, enter
&q=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist asc.
• To return the distance as the document score, enter
&q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc.

• It’s easiest to understand what faceted search is through an example,
appropriately from CNET Reviews, the first website to use Solr even before it
had been contributed to Apache by CNET.

• Faceted search provides an effective way to allow users to refine search
results, continually drilling down until the desired items are found. The
benefits include
1. Superior feedback – users can see at a glance a summary of the search
results and how those results break down by different criteria.
2. No surprises or dead ends – users know how many results match before
they click. Values with zero counts are normally removed to reduce visual
noise and eliminate the possibility of a user accidentally selecting a
constraint that would lead to no results.
3. No selection hierarchy is imposed – users are generally free to add or
remove constraints in any order

• Submitting using solrj (an example):
String ZookeeperQuorum="192.168.10.1,192.168.10.2,192.168.10.3";
CloudSolrServer server=new CloudSolrServer(ZookeeperQuorum);
server.setDefaultCollection("yourCollection");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id","111");
doc.addField("intfield",100);
doc.addField("StringField","some data");
server.connect();
server.add(doc);
server.commit();
• Note:
1. In this example we specify a collection, but we do not specify which shard
to submit. Solr automatically performs load balancing between shards so
that sizes of shards will roughly be equal.
2. If you like, you may specify which shard to submit.
3. In this example we do not need to know where the solr instances are – this
is managed by Zookeeper

• Updating a doc using solrj (an example):
doc.addField("id", id);
Map<String, String> importIDupdate = new HashMap<String, String>();
importIDupdate.put("set", “A1234567”);
doc.addField("importID", importIDupdate);
Need to specify a unique key syntax, fixed

• Import CSV files using curl:
• Suppose we have a CSV file in example/exampledocs/books.csv
 Example of using HTTP-POST to send the CSV data over the
network to the Solr server:
cd example/exampledocs
curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain;
charset=utf-8'

• Suppose the collection is properly configured. Then you can
pass a rich text (word, pdf, ppt, …) to solr for indexing by using
the following api:
ContentStreamUpdateRequest up= new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(filePath), "application/xml; charset=UTF-8");
/* the literal.id=doc1 param provides the necessary unique id for the document being indexed */
up.setParam("literal.id", solrId);
/*The uprefix=attr_ param causes all generated fields that aren't defined in the schema to be prefixed *
* with attr_ (which is a dynamic field that is stored) */
up.setParam("uprefix", "attr_");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
server.request(up);

• Delete all documents in Solr
http://host:8983/solr/core/update?stream.body=<delete><query>*:*</quer
y></delete>
 Delete documents with id in a range:
http://host:8983/solr/update?stream.body=<delete><query>id:[200000001
TO 200000002]</query></delete>&commit=true
 If you want to delete items that matches more than one field,
just add another query:
http://host:8983/solr/update?stream.body=
<delete><query>id:298253</query>
<query>entitytype:BlogEntry</query></delete>&commit=true

• Firtst, I set up a collection for test as follows:
• Go to the official website and download solr-4.6.0.tgz
• Copy solr-4.6.0.tgz to /home/ziv
• In the terminal, go to /home/ziv and unzip solr-4.6.0.tgz
• Now we have /home/ziv/solr-4.6.0
$tar zxvf solr-4.6.0.tgz

 Go to /home/ziv/solr-4.6.0/example
 Create a collection for DB import test:
Created by me.
Copied from
example-DIH
dataImportHandler
Official Example
config

 Look into example-DIH-test
This is the primary configuration file Solr looks for when starting.
This file specifies the list of "SolrCores" it should load, and high
level configuration options that should be used for all SolrCores.

 Look into the db folder
 This directory is mandatory and must contain your
solrconfig.xml and schema.xml.
 Any other optional configuration files would also be kept here.
 This directory is the default location where Solr will keep your index, and
is used by the replication scripts for dealing with snapshots.
 You can override this location in the conf/solrconfig.xml.
 Solr will create this directory if it does not already exist.

 Look into the solr/db folder
 This directory is optional.
 If it exists, Solr will load any Jars found in this directory and use them to
resolve any "plugins“ specified in your solrconfig.xml or schema.xml (ie:
Analyzers, Request Handlers, etc...).
 Alternatively you can use the <lib> syntax in conf/solrconfig.xml to direct
Solr to your plugins.

 Look into the solr/db/conf folder
Schema definition
Path setup, UpdateHandler
setup, RequestHandler
setup, …
Specify what are going
to be imported from DB
to Solr

 Write the following content to solrconfig.xml
<lib dir="../../../../dist/" regex="solr-dataimporthandler-.*.jar" />
<lib dir="../../../../dist/" regex="postgresql-d.*.jar" />
 Write the following content to db-data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost/SolrTest"
user="postgres"
password="postgres"/>
<document>
<entity name="id"
query="select id,features from solrtest">
</entity>
</document>
</dataConfig>

 Download jdbc driver for postgresql and put it in
/home/ziv/solr-4.6.0/dist/
 Now run up this core:
$java -Dsolr.solr.home="./example-DIH-test/solr/" -jar start.jar

20150210 solr introdution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20150210 solr introdution

Similar to 20150210 solr introdution (20)

Recently uploaded

Recently uploaded (20)

20150210 solr introdution