SlideShare a Scribd company logo
20150210 Ziv Huang
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• Solr是一個全文檢索系統,可針對大量文件的內容,可輸入任意字詞的
關鍵字及其邏輯運算(AND、OR 、NOT)等,進行快速內文查詢,並提
供查詢結果,依其文件符合程度的評分排序或文件相關資訊分類,以
便進一步進行統計、分析及彙整的系統。常見的全文檢索的資料對象
有新聞、文件報告、期刋、書籍或是網站內容等。
• Solr採用索引的方法,也就是先將文件內容切割出字詞單元(token),再
將這些字詞以「雜湊表」或「B+樹」等資料結構,建立索引檔,紀錄
其文件編號及在文件中出現的位置。在進行查詢時,系統先將輸入的
字串,進行字詞單元分析,再將這些字詞一一使用索引快速搜尋,接
著將結果依輸入的條件進行邏輯運算,並依在文件中出現的次數等關
係計算各結果的權重,最後排序輸出結果。
• 某些資料庫系統中,例如MS SQL與Oracle皆有內建簡易的全文檢索功
能。但是要進行較為複雜客製化的全文檢索功能,就捉襟見肘了,必
須另外採用專業的全文檢索引擎才能達成。目前開放源碼的專案中已
有十年歷史的Lucene是個不錯的選擇。但缺點是Lucene是以程式庫的方
式提供,必需以Java語言撰寫程式才能取用,且功能繁瑣,學習期長,
並不易切入進行實作。
• Solr以Lucene 程式庫為核心,進行全文資料的索引建立和搜尋執行;並
且以HTTP協定的方式提供Web service,方便各種程式語言呼叫。Solr提
供強大的設定檔,在不需編寫Java碼的情況下,就可以配置供一般全
文檢索用途使用;有特殊需求時,亦可依循其插件(plugin)架構,編
寫擴充功能。
• Solr主要的特性包括全文檢索、命中標示(hit highlighting)、層面檢索
(faceted search)、動態群聚分類(clustering)、資料庫整合、文件
(如WORD,PDF檔)處理及空間資料搜尋。Solr具有高度擴充性的架
構,提供分散式檢索及索引資料庫複製等功能。許多的大型網站,採
用Solr提供搜尋及瀏覽操作的功能。
Solr具有以下特性:
• 使用設定檔定義資料Schema,包含數字類別、動態欄位及唯一鍵(Unique Key)
• 擴充Lucene的查詢語法
• 層面分類搜尋及縮小範圍過濾功能
• 地理空間資料搜尋
• 使用設定檔就可設定原始資料文本分析(tokenize)及過濾(字幹處理、停用字)
• 可設定的搜尋結果快取(Cache)
• 搜尋效能優化
• 使用XML格式系統設定檔
• 提供系統管理用界面
• 提供監視用的系統記錄(Log)
• 快速增量式更新及索引複製
• 具有可跨數個主機的調配式分散搜尋
• 進行索引的原始資料,可使用JSON、XML、CSV /符號分隔的文字檔和二進制檔等格式
• 可從本地磁碟或HTTP來源,取得資料庫或XML文件資料,進行索引
• 使用Apache Tika進行豐富文件(PDF,WORD,HTML等)解析及索引
• 支援多個資料索引
• 支援多國語言資料分析
• Solr已將全文檢索伺服器包裝成只要經由修改系統設定及schema定義二
個XML格式的文字檔,就可運作。以MVC系統架構觀點而言,Solr已提
供了Model的功能,應用系統只要將心力專注於View的畫面配置與UI設
計,以及Controller的request的HTTP URL參數編排、回傳結果的
XML/JSON內容解析就可以完成系統了。Solr大幅簡化全文檢索系統開
發的複雜度。
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• Solr is not meant to entirely replace your RDBMS but rather complement it
• One of the things that Solr does best is to answer a question such as: “What
is a list of the most relevant documents or fields, that possibly match query
‘XYZ’? ” In this case, your “XYZs” might match a bunch of documents or
records with fields that Solr may have tokenized, stored, queried and/or
ranked to produce a list of result documents.
• Solr also provides the ability to quickly filter results by facet fields enriching
the search experience so that your users can narrow the list to find the right
item or set of items based on faceted fields.
• By contrast, the RDBMS in its classic implementation is meant to answer
questions such as: exact match queries, e.g., “give me all records in my users
table with a creation date after Oct 1, 2009”; or, reporting-related queries,
like “what is the average file size of images uploaded to my photo site
grouped by user and date”
• There is one other thing the RDBMS does quite well: efficiently executing a
series of inserts and updates for a transaction, rolling back if one of those
operations failed (also known as ACID properties: Atomicity, Consistency,
Isolation, Durability).
• The best way to think about Solr is that it’s a quickly searchable view of your
data. A well-designed application can use the best of both these
approaches, utilizing Solr to help users find the most relevant documents and
then use your RDBMS to query for more precise additional information to
better present the results to the end user.
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
collection
shard1 192.xxx.xxx.111
shard2 192.xxx.xxx.222
shard3 192.xxx.xxx.333
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Physical nodes
SolrTerminology description
Collection A complete logical index in a SolrCloud cluster.
It is associated with a config set and is made up of one or more shards
Config Set A set of config files necessary for a collection to function properly.
At minimum this will consist of solrconfig.xml and schema.xml, but
depending on the contents of those two files, may include other files
Config Set
logical
SolrTerminology description
Shard A logical piece (or slice) of a collection.
Each shard is made up of one or more replicas.
An election is held (by zookeeper) to determine which replica is the
leader.
Replica One copy of a shard. One of them will be elected to be the leader
collection
shard1 192.xxx.xxx.111
shard2 192.xxx.xxx.222
shard3 192.xxx.xxx.333
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set
Physical nodeslogical
SolrTerminology description
Shard Leader The shard replica that has won the leader election.
Elections can happen at any time, but normally they are only triggered by
events like a Solr instance going down.
When documents are indexed, SolrCloud will forward them to the leader
of the shard, and the leader will distribute them to all the shard replicas.
collection
shard1 192.xxx.xxx.111
shard2 192.xxx.xxx.222
shard3 192.xxx.xxx.333
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set
Physical nodeslogical
SolrTerminology description
Zookeeper SolrCloud requires Zookeeper to handles leader elections.
It is recommended that it be standalone, installed separately from Solr.
A majority of servers are needed to provide service (e.g., 5 zookeeper
servers are needed to allow for the failure of up to 2 servers at a time.).
Zookeeper can run on the same hardware as Solr, and many users do run
it on the same hardware.
collection
shard1 192.xxx.xxx.111
shard2 192.xxx.xxx.222
shard3 192.xxx.xxx.333
192.xxx.xxx.222
192.xxx.xxx.333
192.xxx.xxx.111
Shard leader
Shard leader
Shard leader
Zookeeper
Config Set
Physical nodeslogical
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
This is a query respose
This is a solr doc
These are fields defined in schema.xml
• The content of schema.xml looks roughly like this:
<schema name="your name" version="x.x">
<types> </types>
<fields> </fields>
<uniqueKey> </uniqueKey>
</schema>
Field types
Fields
A field
• The content of schema.xml looks roughly like this:
<schema name="your name" version="x.x">
<types> </types>
<fields> </fields>
<uniqueKey> </uniqueKey>
</schema>
Field types
Fields
A field
• Attribute "name" is the name of this schema and is only used for display
purposes.
• Version="x.y" is Solr's version number for the schema syntax and semantics.
It should not normally be changed by applications.
• Field “uniqueKey” is used to determine and enforce document uniqueness.
This is not required.
But if you don’t need it, you’d better have a good reason.
• An example:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Property Description values
name The name of the fieldType. It is strongly recommended that names
consist of alphanumeric or underscore characters only and not start
with a digit.
class The class name that gets used to store and index the data for this
type.
indexed If true, then this filed is searchable, sortable, and facetable. true/false
stored If true, the actual value of the field can be retrieved by queries true/false
required Instructs Solr to reject any attempts to add a document which does
not have a value for this field. This property defaults to false.
true/false
multiValued If true, indicates that a single document might contain multiple
values for this field type
true/false
• General Properties:
Property Description values
positionIncrementGap For multivalued fields, specifies a distance between
multiple values, which prevents spurious phrase
matches
Integer
autoGeneratePhraseQueri
es
For text fields. If true, Solr automatically generates
phrase queries for adjacent terms. If false, terms must
be enclosed in double-quotes to be treated as phrases.
true/false
docValues If true, the value of the field will be put in a column-
oriented DocValues structure (this is for performance
boost in sorting, faceting, highlighting)
true/false
sortMissingFirst
sortMissingLast
If sortMissingLast="true", then a sort on this field
will cause documents without the field to come after
documents with the field, regardless of the requested
sort order (asc or desc).
true/false
• General Properties:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
• This is a general text field that has reasonable, generic cross-language defaults: it
tokenizes with StandardTokenizer, removes stop words from case-insensitive
"stopwords.txt"(empty by default), and down cases.
• At query time only, it also applies synonyms.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
• Dynamic fields allow Solr to index fields that you did not explicitly define in your
schema. This is useful if you discover you have forgotten to define one or more fields.
(We note that changing schema in solr is very costly!)
• A dynamic field is just like a regular field except it has a name with a wildcard in it.
• For example, suppose your schema includes a dynamic field with a name of *_i.
If you attempt to index a document with a cost_i field, but no explicit cost_i field is
defined in the schema, then the cost_i field will have the field type and analysis
defined for *_i.
• In practice, we put dynamic fields in schema.xml for all data types for greatest schema
flexibility.
• An example:
<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<copyField source="manu" dest="text"/>
<copyField source="features" dest="text"/>
• A copyField copies one field to another at the time a document is added to the index.
In the above example, content in four fields (cat, name, manu, features) will be copied
to field text.
• When to use copyField?
1. If you want to provide a default search field that essentially search several fields at
the same time when a query comes to that default field.
2. If you want to send the same data to two different field at the same time.
• An example:
<!-- Create a string version of author for faceting -->
<copyField source="author" dest="author_s"/>
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
Input query
syntax here
Field asc|desc
Range of results
Fields to be
returned
Return format
Search result
• Basic query syntax:
• fq: This parameter can be used to specify a query that can be used to
restrict the super set of documents that can be returned, without influencing
score.
• It can be very useful for speeding up complex queries since the queries
specified with fq are cached independently from the main query. Caching
means the same filter is used again for a later query.
• An example:
http://localhost:8983/solr/select?
q=cars
&fq=color:black
&fq=model:Lamborghini
&fq=year:[2014 TO *]
• By default, Solr resolves all of the filters before the main query. Each filter
query is looked up individually in Solr’s filterCache
• The following parameters are used for spatial search:
Parameter Description
d the radial distance, in kilometers
pt the center point using the format "lat,lon" if latitude & longitude.
sfield a spatial indexed field
• geofilt: For example, to find all documents within five kilometers of a given
lat/lon point, you could enter
&q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5
• bbox: very similar to geofilt except it uses the bounding box of
the calculated circle. Here's a sample query:
• The rectangular shape is faster to compute and so it's sometimes
used as an alternative to geofilt when it's acceptable to return
points outside of the radius.
&q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5
• geodist: a distance function that takes three optional
parameters: (sfield,latitude,longitude). You can use the geodist
function to sort results by distance or score return results.
• For example, to sort your results by ascending distance, enter
&q=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist asc.
• To return the distance as the document score, enter
&q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc.
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• It’s easiest to understand what faceted search is through an example,
appropriately from CNET Reviews, the first website to use Solr even before it
had been contributed to Apache by CNET.
• Faceted search provides an effective way to allow users to refine search
results, continually drilling down until the desired items are found. The
benefits include
1. Superior feedback – users can see at a glance a summary of the search
results and how those results break down by different criteria.
2. No surprises or dead ends – users know how many results match before
they click. Values with zero counts are normally removed to reduce visual
noise and eliminate the possibility of a user accidentally selecting a
constraint that would lead to no results.
3. No selection hierarchy is imposed – users are generally free to add or
remove constraints in any order
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• Submitting using solrj (an example):
String ZookeeperQuorum="192.168.10.1,192.168.10.2,192.168.10.3";
CloudSolrServer server=new CloudSolrServer(ZookeeperQuorum);
server.setDefaultCollection("yourCollection");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id","111");
doc.addField("intfield",100);
doc.addField("StringField","some data");
server.connect();
server.add(doc);
server.commit();
• Note:
1. In this example we specify a collection, but we do not specify which shard
to submit. Solr automatically performs load balancing between shards so
that sizes of shards will roughly be equal.
2. If you like, you may specify which shard to submit.
3. In this example we do not need to know where the solr instances are – this
is managed by Zookeeper
• Updating a doc using solrj (an example):
doc.addField("id", id);
Map<String, String> importIDupdate = new HashMap<String, String>();
importIDupdate.put("set", “A1234567”);
doc.addField("importID", importIDupdate);
Need to specify a unique key syntax, fixed
• Import CSV files using curl:
• Suppose we have a CSV file in example/exampledocs/books.csv
 Example of using HTTP-POST to send the CSV data over the
network to the Solr server:
cd example/exampledocs
curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain;
charset=utf-8'
• Suppose the collection is properly configured. Then you can
pass a rich text (word, pdf, ppt, …) to solr for indexing by using
the following api:
ContentStreamUpdateRequest up= new ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(filePath), "application/xml; charset=UTF-8");
/* the literal.id=doc1 param provides the necessary unique id for the document being indexed */
up.setParam("literal.id", solrId);
/*The uprefix=attr_ param causes all generated fields that aren't defined in the schema to be prefixed *
* with attr_ (which is a dynamic field that is stored) */
up.setParam("uprefix", "attr_");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
server.request(up);
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• Delete all documents in Solr
http://host:8983/solr/core/update?stream.body=<delete><query>*:*</quer
y></delete>
 Delete documents with id in a range:
http://host:8983/solr/update?stream.body=<delete><query>id:[200000001
TO 200000002]</query></delete>&commit=true
 If you want to delete items that matches more than one field,
just add another query:
http://host:8983/solr/update?stream.body=
<delete><query>id:298253</query>
<query>entitytype:BlogEntry</query></delete>&commit=true
• What is Solr
• Solr VS RDBMs
• SolrCloud Architecture
• Docs, Fields & Schema Design
• Searching
• Faceting
• Submitting Docs
• Deleting Docs
• Import From DB (a test)
• Firtst, I set up a collection for test as follows:
• Go to the official website and download solr-4.6.0.tgz
• Copy solr-4.6.0.tgz to /home/ziv
• In the terminal, go to /home/ziv and unzip solr-4.6.0.tgz
• Now we have /home/ziv/solr-4.6.0
$tar zxvf solr-4.6.0.tgz
 Go to /home/ziv/solr-4.6.0/example
 Create a collection for DB import test:
Created by me.
Copied from
example-DIH
dataImportHandler
Official Example
config
 Look into example-DIH-test
This is the primary configuration file Solr looks for when starting.
This file specifies the list of "SolrCores" it should load, and high
level configuration options that should be used for all SolrCores.
 Look into the solr.xml
 Look into the db folder
 This directory is mandatory and must contain your
solrconfig.xml and schema.xml.
 Any other optional configuration files would also be kept here.
 This directory is the default location where Solr will keep your index, and
is used by the replication scripts for dealing with snapshots.
 You can override this location in the conf/solrconfig.xml.
 Solr will create this directory if it does not already exist.
 Look into the solr/db folder
 This directory is optional.
 If it exists, Solr will load any Jars found in this directory and use them to
resolve any "plugins“ specified in your solrconfig.xml or schema.xml (ie:
Analyzers, Request Handlers, etc...).
 Alternatively you can use the <lib> syntax in conf/solrconfig.xml to direct
Solr to your plugins.
 Look into the solr/db/conf folder
Schema definition
Path setup, UpdateHandler
setup, RequestHandler
setup, …
Specify what are going
to be imported from DB
to Solr
 Write the following content to solrconfig.xml
<lib dir="../../../../dist/" regex="solr-dataimporthandler-.*.jar" />
<lib dir="../../../../dist/" regex="postgresql-d.*.jar" />
 Write the following content to db-data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost/SolrTest"
user="postgres"
password="postgres"/>
<document>
<entity name="id"
query="select id,features from solrtest">
</entity>
</document>
</dataConfig>
 Download jdbc driver for postgresql and put it in
/home/ziv/solr-4.6.0/dist/
 Now run up this core:
$java -Dsolr.solr.home="./example-DIH-test/solr/" -jar start.jar
 Execute dataimport
 Execution result:
 Query all data:
Thank you!

More Related Content

What's hot

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Lucidworks
 
Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
Rafał Kuć
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Lucidworks (Archived)
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
Yonik Seeley
 
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, SematextSolr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
Lucidworks
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
Saumitra Srivastav
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
pittaya
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
Lucidworks
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
Yasas Senarath
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
Alexandre Rafalovitch
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
Alexandre Rafalovitch
 
Créer et gérer une scratch org avec Visual Studio Code
Créer et gérer une scratch org avec Visual Studio CodeCréer et gérer une scratch org avec Visual Studio Code
Créer et gérer une scratch org avec Visual Studio Code
Thierry TROUIN ☁
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Positive Hack Days
 
Comment utiliser Visual Studio Code pour travailler avec une scratch Org
Comment utiliser Visual Studio Code pour travailler avec une scratch OrgComment utiliser Visual Studio Code pour travailler avec une scratch Org
Comment utiliser Visual Studio Code pour travailler avec une scratch Org
Thierry TROUIN ☁
 
Learn Ajax here
Learn Ajax hereLearn Ajax here
Learn Ajax here
jarnail
 
Webinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDBWebinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDB
MongoDB
 
ShmooCon 2009 - (Re)Playing(Blind)Sql
ShmooCon 2009 - (Re)Playing(Blind)SqlShmooCon 2009 - (Re)Playing(Blind)Sql
ShmooCon 2009 - (Re)Playing(Blind)Sql
Chema Alonso
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
zekeLabs Technologies
 
IR SQLite Session #1
IR SQLite Session #1IR SQLite Session #1
IR SQLite Session #1
InfoRepos Technologies
 

What's hot (20)

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, SematextSolr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Créer et gérer une scratch org avec Visual Studio Code
Créer et gérer une scratch org avec Visual Studio CodeCréer et gérer une scratch org avec Visual Studio Code
Créer et gérer une scratch org avec Visual Studio Code
 
Attacks against Microsoft network web clients
Attacks against Microsoft network web clients Attacks against Microsoft network web clients
Attacks against Microsoft network web clients
 
Comment utiliser Visual Studio Code pour travailler avec une scratch Org
Comment utiliser Visual Studio Code pour travailler avec une scratch OrgComment utiliser Visual Studio Code pour travailler avec une scratch Org
Comment utiliser Visual Studio Code pour travailler avec une scratch Org
 
Learn Ajax here
Learn Ajax hereLearn Ajax here
Learn Ajax here
 
Webinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDBWebinar: Simplifying Persistence for Java and MongoDB
Webinar: Simplifying Persistence for Java and MongoDB
 
ShmooCon 2009 - (Re)Playing(Blind)Sql
ShmooCon 2009 - (Re)Playing(Blind)SqlShmooCon 2009 - (Re)Playing(Blind)Sql
ShmooCon 2009 - (Re)Playing(Blind)Sql
 
Web scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabsWeb scraping using scrapy - zekeLabs
Web scraping using scrapy - zekeLabs
 
IR SQLite Session #1
IR SQLite Session #1IR SQLite Session #1
IR SQLite Session #1
 

Similar to 20150210 solr introdution

Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
Jay Bharat
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
Binesh Gummadi
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
DataArt
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
Alexander Tokarev
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
Erik Hatcher
 
Apache Solr
Apache SolrApache Solr
Apache Solr
Kevin Wenger
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
Sematext Group, Inc.
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
Lap Tran
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
Spark Summit
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2longkeyy
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
How do Solr and Azure Search compare?
How do Solr and Azure Search compare?How do Solr and Azure Search compare?
How do Solr and Azure Search compare?
SearchStax
 
Solr
SolrSolr
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lucidworks
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 

Similar to 20150210 solr introdution (20)

Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2Solr中国8月4日答疑交流v2
Solr中国8月4日答疑交流v2
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
How do Solr and Azure Search compare?
How do Solr and Azure Search compare?How do Solr and Azure Search compare?
How do Solr and Azure Search compare?
 
Solr
SolrSolr
Solr
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 

Recently uploaded

Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 

Recently uploaded (20)

Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 

20150210 solr introdution

  • 2. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 3. • Solr是一個全文檢索系統,可針對大量文件的內容,可輸入任意字詞的 關鍵字及其邏輯運算(AND、OR 、NOT)等,進行快速內文查詢,並提 供查詢結果,依其文件符合程度的評分排序或文件相關資訊分類,以 便進一步進行統計、分析及彙整的系統。常見的全文檢索的資料對象 有新聞、文件報告、期刋、書籍或是網站內容等。 • Solr採用索引的方法,也就是先將文件內容切割出字詞單元(token),再 將這些字詞以「雜湊表」或「B+樹」等資料結構,建立索引檔,紀錄 其文件編號及在文件中出現的位置。在進行查詢時,系統先將輸入的 字串,進行字詞單元分析,再將這些字詞一一使用索引快速搜尋,接 著將結果依輸入的條件進行邏輯運算,並依在文件中出現的次數等關 係計算各結果的權重,最後排序輸出結果。
  • 4. • 某些資料庫系統中,例如MS SQL與Oracle皆有內建簡易的全文檢索功 能。但是要進行較為複雜客製化的全文檢索功能,就捉襟見肘了,必 須另外採用專業的全文檢索引擎才能達成。目前開放源碼的專案中已 有十年歷史的Lucene是個不錯的選擇。但缺點是Lucene是以程式庫的方 式提供,必需以Java語言撰寫程式才能取用,且功能繁瑣,學習期長, 並不易切入進行實作。 • Solr以Lucene 程式庫為核心,進行全文資料的索引建立和搜尋執行;並 且以HTTP協定的方式提供Web service,方便各種程式語言呼叫。Solr提 供強大的設定檔,在不需編寫Java碼的情況下,就可以配置供一般全 文檢索用途使用;有特殊需求時,亦可依循其插件(plugin)架構,編 寫擴充功能。
  • 5. • Solr主要的特性包括全文檢索、命中標示(hit highlighting)、層面檢索 (faceted search)、動態群聚分類(clustering)、資料庫整合、文件 (如WORD,PDF檔)處理及空間資料搜尋。Solr具有高度擴充性的架 構,提供分散式檢索及索引資料庫複製等功能。許多的大型網站,採 用Solr提供搜尋及瀏覽操作的功能。
  • 6. Solr具有以下特性: • 使用設定檔定義資料Schema,包含數字類別、動態欄位及唯一鍵(Unique Key) • 擴充Lucene的查詢語法 • 層面分類搜尋及縮小範圍過濾功能 • 地理空間資料搜尋 • 使用設定檔就可設定原始資料文本分析(tokenize)及過濾(字幹處理、停用字) • 可設定的搜尋結果快取(Cache) • 搜尋效能優化 • 使用XML格式系統設定檔 • 提供系統管理用界面 • 提供監視用的系統記錄(Log) • 快速增量式更新及索引複製 • 具有可跨數個主機的調配式分散搜尋 • 進行索引的原始資料,可使用JSON、XML、CSV /符號分隔的文字檔和二進制檔等格式 • 可從本地磁碟或HTTP來源,取得資料庫或XML文件資料,進行索引 • 使用Apache Tika進行豐富文件(PDF,WORD,HTML等)解析及索引 • 支援多個資料索引 • 支援多國語言資料分析
  • 8. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 9. • Solr is not meant to entirely replace your RDBMS but rather complement it • One of the things that Solr does best is to answer a question such as: “What is a list of the most relevant documents or fields, that possibly match query ‘XYZ’? ” In this case, your “XYZs” might match a bunch of documents or records with fields that Solr may have tokenized, stored, queried and/or ranked to produce a list of result documents. • Solr also provides the ability to quickly filter results by facet fields enriching the search experience so that your users can narrow the list to find the right item or set of items based on faceted fields.
  • 10. • By contrast, the RDBMS in its classic implementation is meant to answer questions such as: exact match queries, e.g., “give me all records in my users table with a creation date after Oct 1, 2009”; or, reporting-related queries, like “what is the average file size of images uploaded to my photo site grouped by user and date” • There is one other thing the RDBMS does quite well: efficiently executing a series of inserts and updates for a transaction, rolling back if one of those operations failed (also known as ACID properties: Atomicity, Consistency, Isolation, Durability). • The best way to think about Solr is that it’s a quickly searchable view of your data. A well-designed application can use the best of both these approaches, utilizing Solr to help users find the most relevant documents and then use your RDBMS to query for more precise additional information to better present the results to the end user.
  • 11. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 12. collection shard1 192.xxx.xxx.111 shard2 192.xxx.xxx.222 shard3 192.xxx.xxx.333 192.xxx.xxx.222 192.xxx.xxx.333 192.xxx.xxx.111 Shard leader Shard leader Shard leader Zookeeper Physical nodes SolrTerminology description Collection A complete logical index in a SolrCloud cluster. It is associated with a config set and is made up of one or more shards Config Set A set of config files necessary for a collection to function properly. At minimum this will consist of solrconfig.xml and schema.xml, but depending on the contents of those two files, may include other files Config Set logical
  • 13. SolrTerminology description Shard A logical piece (or slice) of a collection. Each shard is made up of one or more replicas. An election is held (by zookeeper) to determine which replica is the leader. Replica One copy of a shard. One of them will be elected to be the leader collection shard1 192.xxx.xxx.111 shard2 192.xxx.xxx.222 shard3 192.xxx.xxx.333 192.xxx.xxx.222 192.xxx.xxx.333 192.xxx.xxx.111 Shard leader Shard leader Shard leader Zookeeper Config Set Physical nodeslogical
  • 14. SolrTerminology description Shard Leader The shard replica that has won the leader election. Elections can happen at any time, but normally they are only triggered by events like a Solr instance going down. When documents are indexed, SolrCloud will forward them to the leader of the shard, and the leader will distribute them to all the shard replicas. collection shard1 192.xxx.xxx.111 shard2 192.xxx.xxx.222 shard3 192.xxx.xxx.333 192.xxx.xxx.222 192.xxx.xxx.333 192.xxx.xxx.111 Shard leader Shard leader Shard leader Zookeeper Config Set Physical nodeslogical
  • 15. SolrTerminology description Zookeeper SolrCloud requires Zookeeper to handles leader elections. It is recommended that it be standalone, installed separately from Solr. A majority of servers are needed to provide service (e.g., 5 zookeeper servers are needed to allow for the failure of up to 2 servers at a time.). Zookeeper can run on the same hardware as Solr, and many users do run it on the same hardware. collection shard1 192.xxx.xxx.111 shard2 192.xxx.xxx.222 shard3 192.xxx.xxx.333 192.xxx.xxx.222 192.xxx.xxx.333 192.xxx.xxx.111 Shard leader Shard leader Shard leader Zookeeper Config Set Physical nodeslogical
  • 16. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 17. This is a query respose This is a solr doc These are fields defined in schema.xml
  • 18. • The content of schema.xml looks roughly like this: <schema name="your name" version="x.x"> <types> </types> <fields> </fields> <uniqueKey> </uniqueKey> </schema> Field types Fields A field
  • 19. • The content of schema.xml looks roughly like this: <schema name="your name" version="x.x"> <types> </types> <fields> </fields> <uniqueKey> </uniqueKey> </schema> Field types Fields A field • Attribute "name" is the name of this schema and is only used for display purposes. • Version="x.y" is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. • Field “uniqueKey” is used to determine and enforce document uniqueness. This is not required. But if you don’t need it, you’d better have a good reason.
  • 20. • An example: <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/> <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/> <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 21. Property Description values name The name of the fieldType. It is strongly recommended that names consist of alphanumeric or underscore characters only and not start with a digit. class The class name that gets used to store and index the data for this type. indexed If true, then this filed is searchable, sortable, and facetable. true/false stored If true, the actual value of the field can be retrieved by queries true/false required Instructs Solr to reject any attempts to add a document which does not have a value for this field. This property defaults to false. true/false multiValued If true, indicates that a single document might contain multiple values for this field type true/false • General Properties:
  • 22. Property Description values positionIncrementGap For multivalued fields, specifies a distance between multiple values, which prevents spurious phrase matches Integer autoGeneratePhraseQueri es For text fields. If true, Solr automatically generates phrase queries for adjacent terms. If false, terms must be enclosed in double-quotes to be treated as phrases. true/false docValues If true, the value of the field will be put in a column- oriented DocValues structure (this is for performance boost in sorting, faceting, highlighting) true/false sortMissingFirst sortMissingLast If sortMissingLast="true", then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc). true/false • General Properties:
  • 23. <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> • This is a general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt"(empty by default), and down cases. • At query time only, it also applies synonyms.
  • 24. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> <dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true" /> <dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/> • Dynamic fields allow Solr to index fields that you did not explicitly define in your schema. This is useful if you discover you have forgotten to define one or more fields. (We note that changing schema in solr is very costly!) • A dynamic field is just like a regular field except it has a name with a wildcard in it. • For example, suppose your schema includes a dynamic field with a name of *_i. If you attempt to index a document with a cost_i field, but no explicit cost_i field is defined in the schema, then the cost_i field will have the field type and analysis defined for *_i. • In practice, we put dynamic fields in schema.xml for all data types for greatest schema flexibility. • An example:
  • 25. <copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> <copyField source="features" dest="text"/> • A copyField copies one field to another at the time a document is added to the index. In the above example, content in four fields (cat, name, manu, features) will be copied to field text. • When to use copyField? 1. If you want to provide a default search field that essentially search several fields at the same time when a query comes to that default field. 2. If you want to send the same data to two different field at the same time. • An example: <!-- Create a string version of author for faceting --> <copyField source="author" dest="author_s"/>
  • 26. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 27. Input query syntax here Field asc|desc Range of results Fields to be returned Return format Search result
  • 28. • Basic query syntax:
  • 29. • fq: This parameter can be used to specify a query that can be used to restrict the super set of documents that can be returned, without influencing score. • It can be very useful for speeding up complex queries since the queries specified with fq are cached independently from the main query. Caching means the same filter is used again for a later query. • An example: http://localhost:8983/solr/select? q=cars &fq=color:black &fq=model:Lamborghini &fq=year:[2014 TO *] • By default, Solr resolves all of the filters before the main query. Each filter query is looked up individually in Solr’s filterCache
  • 30. • The following parameters are used for spatial search: Parameter Description d the radial distance, in kilometers pt the center point using the format "lat,lon" if latitude & longitude. sfield a spatial indexed field • geofilt: For example, to find all documents within five kilometers of a given lat/lon point, you could enter &q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5
  • 31. • bbox: very similar to geofilt except it uses the bounding box of the calculated circle. Here's a sample query: • The rectangular shape is faster to compute and so it's sometimes used as an alternative to geofilt when it's acceptable to return points outside of the radius. &q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5
  • 32. • geodist: a distance function that takes three optional parameters: (sfield,latitude,longitude). You can use the geodist function to sort results by distance or score return results. • For example, to sort your results by ascending distance, enter &q=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist asc. • To return the distance as the document score, enter &q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc.
  • 33. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 34. • It’s easiest to understand what faceted search is through an example, appropriately from CNET Reviews, the first website to use Solr even before it had been contributed to Apache by CNET.
  • 35. • Faceted search provides an effective way to allow users to refine search results, continually drilling down until the desired items are found. The benefits include 1. Superior feedback – users can see at a glance a summary of the search results and how those results break down by different criteria. 2. No surprises or dead ends – users know how many results match before they click. Values with zero counts are normally removed to reduce visual noise and eliminate the possibility of a user accidentally selecting a constraint that would lead to no results. 3. No selection hierarchy is imposed – users are generally free to add or remove constraints in any order
  • 36. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 37. • Submitting using solrj (an example): String ZookeeperQuorum="192.168.10.1,192.168.10.2,192.168.10.3"; CloudSolrServer server=new CloudSolrServer(ZookeeperQuorum); server.setDefaultCollection("yourCollection"); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id","111"); doc.addField("intfield",100); doc.addField("StringField","some data"); server.connect(); server.add(doc); server.commit(); • Note: 1. In this example we specify a collection, but we do not specify which shard to submit. Solr automatically performs load balancing between shards so that sizes of shards will roughly be equal. 2. If you like, you may specify which shard to submit. 3. In this example we do not need to know where the solr instances are – this is managed by Zookeeper
  • 38. • Updating a doc using solrj (an example): doc.addField("id", id); Map<String, String> importIDupdate = new HashMap<String, String>(); importIDupdate.put("set", “A1234567”); doc.addField("importID", importIDupdate); Need to specify a unique key syntax, fixed
  • 39. • Import CSV files using curl: • Suppose we have a CSV file in example/exampledocs/books.csv  Example of using HTTP-POST to send the CSV data over the network to the Solr server: cd example/exampledocs curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'
  • 40. • Suppose the collection is properly configured. Then you can pass a rich text (word, pdf, ppt, …) to solr for indexing by using the following api: ContentStreamUpdateRequest up= new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File(filePath), "application/xml; charset=UTF-8"); /* the literal.id=doc1 param provides the necessary unique id for the document being indexed */ up.setParam("literal.id", solrId); /*The uprefix=attr_ param causes all generated fields that aren't defined in the schema to be prefixed * * with attr_ (which is a dynamic field that is stored) */ up.setParam("uprefix", "attr_"); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); server.request(up);
  • 41. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 42. • Delete all documents in Solr http://host:8983/solr/core/update?stream.body=<delete><query>*:*</quer y></delete>  Delete documents with id in a range: http://host:8983/solr/update?stream.body=<delete><query>id:[200000001 TO 200000002]</query></delete>&commit=true  If you want to delete items that matches more than one field, just add another query: http://host:8983/solr/update?stream.body= <delete><query>id:298253</query> <query>entitytype:BlogEntry</query></delete>&commit=true
  • 43. • What is Solr • Solr VS RDBMs • SolrCloud Architecture • Docs, Fields & Schema Design • Searching • Faceting • Submitting Docs • Deleting Docs • Import From DB (a test)
  • 44. • Firtst, I set up a collection for test as follows: • Go to the official website and download solr-4.6.0.tgz • Copy solr-4.6.0.tgz to /home/ziv • In the terminal, go to /home/ziv and unzip solr-4.6.0.tgz • Now we have /home/ziv/solr-4.6.0 $tar zxvf solr-4.6.0.tgz
  • 45.  Go to /home/ziv/solr-4.6.0/example  Create a collection for DB import test: Created by me. Copied from example-DIH dataImportHandler Official Example config
  • 46.  Look into example-DIH-test This is the primary configuration file Solr looks for when starting. This file specifies the list of "SolrCores" it should load, and high level configuration options that should be used for all SolrCores.
  • 47.  Look into the solr.xml
  • 48.  Look into the db folder  This directory is mandatory and must contain your solrconfig.xml and schema.xml.  Any other optional configuration files would also be kept here.  This directory is the default location where Solr will keep your index, and is used by the replication scripts for dealing with snapshots.  You can override this location in the conf/solrconfig.xml.  Solr will create this directory if it does not already exist.
  • 49.  Look into the solr/db folder  This directory is optional.  If it exists, Solr will load any Jars found in this directory and use them to resolve any "plugins“ specified in your solrconfig.xml or schema.xml (ie: Analyzers, Request Handlers, etc...).  Alternatively you can use the <lib> syntax in conf/solrconfig.xml to direct Solr to your plugins.
  • 50.  Look into the solr/db/conf folder Schema definition Path setup, UpdateHandler setup, RequestHandler setup, … Specify what are going to be imported from DB to Solr
  • 51.  Write the following content to solrconfig.xml <lib dir="../../../../dist/" regex="solr-dataimporthandler-.*.jar" /> <lib dir="../../../../dist/" regex="postgresql-d.*.jar" />  Write the following content to db-data-config.xml <dataConfig> <dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost/SolrTest" user="postgres" password="postgres"/> <document> <entity name="id" query="select id,features from solrtest"> </entity> </document> </dataConfig>
  • 52.  Download jdbc driver for postgresql and put it in /home/ziv/solr-4.6.0/dist/  Now run up this core: $java -Dsolr.solr.home="./example-DIH-test/solr/" -jar start.jar
  • 55.  Query all data: