Building japanese full text search system by Solr

Building Japanese Full-Text Search System by Solr #openSUSE.Asia Summit2017 10/21
Building Japanese Full-Text Search System
by Solr
― Document Seach and Application
to Online Shopping Site —
1
Syuta Hashimoto
opensuse-ja

Self Introduction
・Syuta Hashimoto　@hashimotosyuta
I have worked at Web Product base on open source
eg. Online Shopping site, promotion site, CMS
・ With openSUSE
ー I have used openSUSE
for 4 years in my
home.
　 I love geeko!
2

Main Topic
　TABLE
１　What is Full-Text Search?
２　What is Solr?
３　Let’s use!
４　What is Index?
５　Structure and Role
６　Solr can search from RDBMS!
７　Facet is easy to count
８　Highlighter is easy to highlight and more functions.
3
※You need RDBMS basic knowledge

１　What is Full-text Search？
Q：What is Full-text Search？
A：Search from Full-text！（maybe）
　and Search from Full-text in Multiple Files！
“Multiple Files” is important at “full-text search”
and “enterprise search”
　　　・Point 1
　　　　Usually, the Full-text Search have two types.
　　・Serial Scan Type
　　・Index Type　←Today’s menu
4

USECASE
5
I want to search by word
“openSUSE” from those files!

Full-text Search Type1　”Serial Scan Type”
6
# grep -r ‘openSUSE’ files_A　①　②　③
# soffice files_B/LibreOffice Writer.odt → Ctrl + F　④
# soffice files_B/LibreOffice Calc.ods → Ctrl + F　⑤
# okular files_B/pdf.pdf → Ctrl + F　⑥
① ② ③ ④ ⑤ ⑥
For example, search sequential this method

Full-text Search Type1　”Serial Scan Type”
# grep -r ‘hogehoge’
⇢”Serial Scan Type” search ‘hogehoge’ word from files
under the currentdirectory.
ー Pros
・easy
ー Cons
・slow
・difficult to search from a rich text (e.g. Word)
・many search noise
7

Full-text Search Type2　”Index Type”
8
# curl
‘http://localhost:8983/solr/techproducts/select?indent=on&q=*:o
penSUSE&wt=json’　① ←Today’s topic
①
To make index beforehand
You can search at once from
index made by to search easy

Full-text Search Type2　”Index Type”
　　　“Index Type” make Index about a word that we will
　　 search in advance, and search from that index.
ー Pros
・fast
・Index Type can search from a rich text (e.g. Word)
if Index Type can index.
・less search noise
ー Cons
・you have to build search system
・you need to index what you want to search files
9

２　What is Solr？
About Solr
・Index type full-text search system
・The sub project in Apache Lucene(™)
　　→Apache Lucene is full-text search library
　Solr use this library. so Solr is open source too.
・Because the access is possible like WebAPI,
　The client is OK in anything!
・There is the competitive product
　called the “elasticsearch”
10

３　Let’s use!　
To build at onse! (for local)
１　Install JVM. java version is 1.8 or later.
　　(Leap 42.3 has been already installed.)
２　Download Solr
　　You can download Solr from Solr official site. now version is 7.0.1
http://www.apache.org/dyn/closer.lua/lucene/solr/7.0.1
The zip file has all set.
３　Extract zip file
# unzip solr-7.0.1.zip
　　and move
# cd solr-7.0.1
11

12
Starting、Creating core、Indexing
４　# bin/solr start　←At first, Starting Solr.(no core, no index)
５　# bin/solr create -c mycore
←Creating core by the name of “mycore”
６　# bin/post -c mycore /home/hashimoto/doc/*
←indexing from files to “mycore”
「bin/post」indexing automacically
・　・　・　（outputing indexing logs….）
It is COMPLETED
※Solr official site has tutorial too.(It can experience cluster)

Important Words
・CORE
Core is equivalent to a RDBMS schema.
Core has index format and query settings and more.
When say roughly, search engine itself.
・Schema definition
It calls index format a schema in Solr.
It is like RDBMS table.
・Index
A Data which indexing from target files
according to a schema definition.
13

Solr has “Admin UI” by default
　After Starting, to Access http://localhost:8983/solr/ ….
14
Admin UI is displayed

３　Let’s use!
“mycore” is registered.
15
“mycore” is
registered properly

３　Let’s use!
You can search from “Query” in “mycore”
16
①this is
“Query”
②Input
search
word
③execute
④result is here

What is index?
17
This is.
The contents is correspondence
of a word of each files to a file
name.

18
The contents of index
（image of index）
so when you search for the word “openSUSE”, responding
immediately “text1.txt” and “LibreOfficeWriter.ods” has that.
WORD FILE WHICH HAS WORD
openSUSE text1.txt LibreOffice Writer.ods
conference text2.txt pdf.pdf
・・・・・・

19
Index definition = Schema
A definition is called schema.
It is as follows to define by schema.
　　　・Field
　　　　column saying by RDBMS. designated field type.
a text is broken into word and is registered.
　　　・Field Type
field definition. defining numeric or string and
whether to do or not morphological analysis
　　　・There is Dynamic Field and Copy Field.
(today, these are omitted.)

Indexing
20
Indexing is “registering to field according to the field
definition” about “content of search target file”
By The Way・・・
When register to the field, doing something
about easy search.
(Doing something is defined in field type)

21
Doing something？
・For example, converting all letters to lowercase.
→”linux” or “Linux” or “LINUX”, convert all of those to “linux”(lowercase).
when searching does the same conversion, can hit all “linux”.
・In Japanese, dividing on the basis of part of speech.
「私は東京都で開催されるアジアサミットに行きます。」
→「私-は-東京-都-で-開催-さ-れる-アジア-サミット-に-行き-ます」
this case search word “東京” is hit but search word “京都” is not hit.
reducing search noise when search from many files.
It is profound technique
called the “morphological analysis”

５　Structure and Role
22
Components Figure
①search
④result
①registration
②indexing
②query
③result
access Solr by REST-api

Set Up is finished! enjoy good search life!!
23
What’s? My Shopping site have
data in MySQL.
Item description like search is too late….
Oh…..

DataImportHandler
In fact, Solr has a structure that can
search from RDBMS and more data source.
by a viewpoint from “full-text search”, It
expects search at item description on online
shopping site.
but, Solr can use facet search and
highlighter, so more useful.
24

25
Components Figure when using RDBMS
①search
④result
①registration
②indexing
②query
③result
・・・ RDBMS

Logical structure
2626
Searching “geeko” at
description
Result is in the
data of name is
“openSUSE”
Solr let the field of a schema and a column
of RDBMS be equivalent and index it.
RDBMS
Schema
Field name=id
Field name=name
Field name=description
id name description
1 openSUSE geeko is cute!!

Settings is slightly troublesome
● Put connector for RDBMS access.
→Put JDBC connector in “server/lib”
● Field Definition
→at next page
● Write settings in solrconfig.xml (setting file of core)
・Read DataImportHandler library
・Declare useing DataImportHandler and setting file *a
● Setting file for DataImportHandler（*a’s file）
・RDBMS connection settings
・Correspondence of a field and the SQL
27
This is an overview.
please see other
document for detail.

Field Definition
Define schema at admin ui to be quick
28
①select
“Schema”
②choice
“Add Field”
③set each
settings,
and click
the “Add
Field”

Setup is finished! Let’s import.
In the usual way REST-api.
http://localhost:8983/solr/mycore/dataimport?command=full-i
mport
29
Our “mycore”
Incidentally,
URI「/dataimport」 is defined at
requestHandler setting in
solrconfig.xml
Importing is finished only in this.
You can seach in admin ui.
For practical use, designing to import difference or
timming of importing.

Facet Search
This is a function to count after grouping.
For example, to get a count of a type in this case.
30
id name description type
1 docker container virtualization
2 emacs multiple editor editor
3 vim multiple editor editor
4 chrome browser browser
5 firefox browser browser
6 sleipnir browser browser
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"type":[
"virtualization",1,
"editor",2
“browser”,3]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}

Facet Search REST-api
The way is to add query field of facet search.
http://localhost:8983/solr/mycore/select?facet=on&facet.field=
type&indent=on&q=*:*&wt=json
31
・facet=on
　 Enable facet search
・facet.field=type
grouping and count by “type”
Of cource, facet search can be combined with a normal
search.

８　Highlighter is easy to highlight
Highlighter
Solr can get result of highrigt separately from normal
result.
For example, To search at “worldwide” from a
description in this data.
32
id name description
1 openSUSE The openSUSE project is a worldwide effort that promotes
the use of Linux everywhere. openSUSE creates one of
the world's best Linux distributions, working together in an
open, transparent and friendly manner as part of the
worldwide Free and Open Source Software community.

To search with highlighter….
33
"highlighting":{
"1":{
"description":["The openSUSE project is a <em>worldwide</em> effort
that promotes the use of "]}}
The openSUSE project is a worldwide effort that promotes the use of Linux
everywhere. openSUSE creates one of the world's best Linux distributions,
working together in an open, transparent and friendly manner as part of the
worldwide Free and Open Source Software community.
“worldwide” word is surrounded by <em>
tag. and retrieve text around the word.

Search with Highligter REST-api
In the usual way to add query parameter.
http://localhost:8983/solr/mycore/select?hl=on&hl.fl=descripti
on&indent=on&q=description:worldwide&wt=json
34
・hl=on
Highligter on
・hl.fl=description
Assign description field for highlight

Setting of Highlighter
・”searchComponent” section in solrconfig.xml.
・To set several things to field.
a. set “stored” that keep retrieved data is true.
b. set things analysing to fieldtype.
Highrighter can set some combination.
You can use a default, but settings can careful control.
hl.method
hl.qparser
hl.requireFieldMatch
hl.usePhraseHighlighter
etc...
35

And more functions
Spatial
36

And more functions
Cloud
37
Recommend
No Image

１ Solr is index type full-text search system.
２ Field definition is called “Schema”
This decides a structure of an index.
３ Solr can search from RDBMS too.
４ facet search, Highlighter is too easy.
８　ハイラGood Search Life！！
Have a lot of fun...
38
Today’s summary

Building japanese full text search system by Solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Building japanese full text search system by Solr

Similar to Building japanese full text search system by Solr (20)

Recently uploaded

Recently uploaded (20)

Building japanese full text search system by Solr