SlideShare a Scribd company logo
1 of 52
Download to read offline
Solr로 나만의 검색엔진을 만들어보자
   - 제1회 루씬 한글분석기 기술세미나 -


           강동혁
       ㈜맥스트 책임연구원
         2013-04-12
’02
대용량 텍스트 데이터베이스의
효율적인 점진적 클러스터링 기법 – 텍스트 마이닝

‘09
VitaminMD – 사이트 통합검색
공개SW 공모대전 - 루씬한글분석기

‘12
Ponket – 검색 및 쿠폰 자동분류

‘13
Augmented Reality Platform - 위치검색
PART I. 기초편
PART II. 활용편
PART I. 기초편
An introduction to Solr
• Lucene 기반
  –   텍스트 검색 엔진 라이브러리
  –   다양한 텍스트 분석기 제공
  –   scoring 알고리즘
  –   text highlighter
• Solr는 Lucene의 서버 버전
  –   문서 색인 및 질의를 HTTP request로 처리
  –   빠른 질의 성능을 위해 cache 사용
  –   웹기반 운영툴 제공
  –   스키마, 서버 설정 파일
  –   disjunction-max 질의
  –   more-like-this 플러그인
  –   분산 서버
Install & Deploy
1. http://lucene.apache.org/solr/ 접속한다.
2. 최신버전 (solr-4.2.1.zip) 다운로드한다.
3. 압축을 푼다.
4. cd example
   java –jar start.jar
5. http://localhost:8983/solr/admin 확인한다.

• Apache Tomcat 이용 시 dist/solr-4.x.x.war 를 Apache
  Tomcat 의 webapps 에 복사한다.
Schema & configuration file
• schema.xml
   –   검색하고자 하는 데이터의 구조
   –   데이터 필드 타입(<types>), 데이터 필드(<fields>)
   –   unique identified, 기본 검색 필드, 기본 연산자(AND/OR)
   –   필드 복사
• solrconfig.xml
   – 색인, 질의를 위한 파라미터
   – <requestHandler>
schema.xml
<types>
   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
      omitNorms="true"/>
   …
</type>
<fields>
   <field name="id" type="string" indexed="true" stored="true“
      required="true" />
   …
</fields>

<uniqueKey>id</uniqueKey>
<defaultSearchField>text</defaultSearchField>
<copyField source="category" dest="text"/>
<copyField source="name" dest="text"/>
schema.xml types
<types>
   <fieldType name="string" class="solr.StrField" sortMissingLast="true“
      omitNorms="true"/>
   <fieldType name="boolean" class="solr.BoolField"
      sortMissingLast="true" omitNorms="true"/>
   <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
   <fieldType name="long" class="solr.LongField" omitNorms="true"/>
   <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
   <fieldType name="double" class="solr.DoubleField"
      omitNorms="true“/>
</type>
schema.xml fields
<fields>
   <field name="id" type="string" indexed="true" stored="true"
      required="true" />
   <field name="sku" type="textTight" indexed="true" stored="true"
      omitNorms="true"/>
   <field name="name" type="text" indexed="true" stored="true"/>
   <field name="nameSort" type="string" indexed="true"
      stored="false"/>
   <field name="text" type="text" indexed="true" stored="false"
      multiValued="true"/>
</fields>
Custom field types
 <fieldType name="text" class="solr.TextField“
       positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt"
         ignoreCase="true" expand="false"/>
       …
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      …
   </analyzer>
</fieldType>
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="false“
          synonyms="index_synonyms.txt" ignoreCase="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"
          words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" catenateAll="0"
          generateWordParts="1" generateNumberParts="1"
          catenateWords="1" catenateNumbers="1"
          splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory"
          protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
        …
     </analyzer>
   </fieldType>
<fieldType name="text" class="solr.TextField"
       positionIncrementGap="100">
    <analyzer type="index">
       …
    </analyzer>
    <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.SynonymFilterFactory" expand="true“
         synonyms="synonyms.txt" ignoreCase="true"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
         words="stopwords.txt"/>
     <filter class="solr.WordDelimiterFilterFactory" catenateAll="0“
         generateWordParts="1" generateNumberParts="1"
         catenateWords="0" catenateNumbers="0"
         splitOnCaseChange="1"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.KeywordMarkerFilterFactory"
         protected="protwords.txt"/>
     <filter class="solr.PorterStemFilterFactory"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
Text analysis
1.   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
2.   <filter class="solr.SynonymFilterFactory" synonyms=“synonyms.txt“
             ignoreCase="true" expand="false"/>
3.   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
4.   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1“
             generateNumberParts="1" catenateWords="1" catenateNumbers="1"
             catenateAll="0“ splitOnCaseChange="1"/>
5.   <filter class="solr.LowerCaseFilterFactory"/>
6.   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>



Input) I can fly with SolrSearch if you can.
1. I, can, fly, with, SolrSearch, if, you, can
2. I, can, run, with, SolrSearch, if, you, can (“fly=>run” in synonyms.txt)
3. I, can, run, SolrSearch, if, you, can (“with” in stopwords.txt)
4. I, can, run, Solr, Search, if, you, can
5. i, can, run, solr, search, if, you, can
6. i, can, run, solr, search, if, you
Indexing data
• http://localhost:8983/solr/update
• Sending XML to Solr
     <add>
      <doc>
       <field name="employeeId">05991</field>
       <field name="office">Bridgewater</field>
       <field name="skills">Perl</field>
       <field name="skills">Java</field>
      </doc>
      [<doc> ... </doc>[<doc> ... </doc>]]
     </add>


• Deleting documents by ID and by Query
     <delete><id>05991</id></delete>
     <delete><query>office:Bridgewater</query></delete>
DataImportHandler
• XML 또는 relational DB 에서 직접 Solr 로 data를
  import
• solrconfig.xml
   <requestHandler name="/dataimport"
      class="org.apache.solr.handler.dataimport.DataImportHandler"
      >
      <lst name="defaults">
       <str name="config“>data-config.xml</str>
      </lst>
    </requestHandler>

• http://localhost:8983/solr/dataimport
Direct database import
• data-config.xml
   <dataConfig>
   <dataSource driver="org.hsqldb.jdbcDriver"
      url="jdbc:hsqldb:/temp/example/ex“ />
     <document name="products">
         <entity name="item" query="select * from item">
            <field column="ID" name="id" />
            <field column="NAME" name="name" />
            <field column="MANU" name="manu" />
            <field column="PRICE" name="price" />
         </entity>
     </document>
   </dataConfig>
Delta-import command
• 마지막 import 이후에 생성된 데이터만 import
• conf/dataimport.properties 에 마지막 import 시간
• http://localhost:8983/solr/dataimport?command=delta-
  import
• data-config.xml
  <dataConfig>
    <dataSource driver="org.hsqldb.jdbcDriver"
     url="jdbc:hsqldb:/temp/example/ex" user="sa" />
    <document name="products">
        <entity name="item" pk="ID"
              query="select * from item"
              deltaImportQuery="select * from item where
                    ID='${dih.delta.id}'"
              deltaQuery="select id from item where
                    last_modified &gt; '${dih.last_index_time}'">
        </entity>
    </document>
  </dataConfig>
Basic searching
• http://localhost:8983/solr/select?<query parameters>
   –   q: query string
   –   q.op: 질의 연산자(AND/OR)
   –   sort: 정렬 방식(필드명 asc[desc])
   –   qt: query type(solrconfig.xml)
   –   start: 시작 index(default: 0)
   –   row: 검색 결과에 포함되는 row 수(default: 10)
   –   wt: response format(xml, json, javabin, python, etc)
   –   hl: 하이라이팅 여부
   –   hl.fl: 하이라이팅 필드
   q=video&fl=name,id
   q=video&sort=price desc
   q=video card&fl=name,id&hl=true&hl.fl=name,features
   q=video&fl=name,id&wt=json
   q=video&fl=name,id&start=20&row=10
Request handlers
• solrconfig.xml
    <requestHandler name="standard" class="solr.SearchHandler"
      default="true">
      <!-- default values for query parameters -->
       <lst name="defaults">
        <str name="echoParams">explicit</str>
        <!--
        <int name="rows">10</int>
        <str name="fl">*</str>
        <str name="version">2.1</str>
         -->
       </lst>
    </requestHandler>
Dismax query
• Disjunction Max query
   – Disjunction: 질의가 여러 필드를 대상으로 수행
   – Max: 검색 대상 필드들의 가중치에 따라 score결정
• Query parameters
   – qf: 필드 목록 및 가중치(예, subject^2.3 content tag^1.2)
   – mm, pf, ps, tie, bq 등
Dismax query handler
• solrconfig.xml
    <requestHandler name="dismax" class="solr.SearchHandler" >
      <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="sort">regdttm desc</str>
       <float name="tie">0.01</float>
       <str name="qf">content^0.5 subject^2.0</str>
       <str name="pf">content^0.5 subject^2.0</str>
       <str name="fl">*</str>
       <bool name="hl">true</bool>
       <str name="hl.fl">subject content</str>
      </lst>
    </requestHandler>
More-like-this(MLT) search
• 연관 검색
MLT query
• Query parameters
   –   mlt: MLT 여부(true/false)
   –   mlt.count: MLT 검색 결과 row수(default:5)
   –   mlt.fl: MLT 검색에 사용할 필드
   –   mlt.qf: 필드 가중치
MLT handler
• solrconfig.xml
    <requestHandler name="mlt"
      class="solr.MoreLikeThisHandler">
       <lst name="defaults">
         <int name="rows">5</int>
         <str name="fl">qid,subject</str>
         <bool name="mlt">true</bool>
         <str name="mlt.fl">subject,content</str>
         <bool name="mlt.boost">true</bool>
        <str name="mlt.qf“>content^0.5 subject^2.0</str>
       </lst>
    </requestHandler>
Solr cores
• 복수의 데이터 색인
• Single core
   – http://localhost:8983/solr/select?q=lucene
• Multicore
   – http://localhost:8983/solr/news/select?q=ios
   – http://localhost:8983/solr/blog/select?q=android
• solr.xml
   <solr sharedLib="lib">
   <cores adminPath="/admin/cores">
      <core name=“news" instanceDir=“news" />
      <core name=“blog" instanceDir=“blog" />
   </cores>
   </solr>
PART II. 활용편
과민성장증후군
과민/성장/증후군?
과민성/장/증후군
• 의학용어 사전 구축
 – compounds.dic : 937건
   가는근육잔섬유:가는,근육,잔,섬유
   가려움약:가려움,약
   가로무늬근:가로무늬,근
   고알도스테론증:고,알도스테론,증
   고요산혈증:고,요산,혈증
   과민성장증후군:과민성,장,증후,군
   근육위축가쪽경화증:근육,위축,가쪽,경화증
   뇌없음증:뇌,없음,증
   헛배부름:헛,배,부름
   판토록:판토록


 – extension.dic : 1,071건
   가드너,10000X
   가려움,10000X
   가로막,10000X
   가습기,10000X
   가와사키,10000X
   가족성,10000X
   가학성,10000X
   각기병,10000X
   각화,10000X
만성간염
복합명사 분해
• 복합명사 검색
  – “만성간염”이 “만성”, “간염”으로 분해
  – 검색결과: “B형간염”, “A형간염”, “간염”, “만성간염”
• AND 검색을 위해 Korlucene 수정
  – 수정 전



  – 수정 후



  – 참고) http://cafe.naver.com/korlucene/116
Keyword suggestion
     • 자소분해 : 초성, 중성, 종성
       예) 기미 = ㄱ+ㅣ+ㅁ+ㅣ
     • UTF-8 한글 코드 테이블
       – 초성: 19
       – 중성: 21
       – 종성: 28
KoreanJasoFilter.java
int a, b, c; // 자소 버퍼: 초성/중성/종성 순
for (int i = 0; i < termLength; i++) {
   char ch = termBuffer[i];
   if (ch >= 0xAC00 && ch <= 0xD7A3) {
       // "AC00:가" ~ "D7A3:힣" 에 속한 글자면 분해
       c = ch - 0xAC00;
       a = c / (21 * 28);
       c = c % (21 * 28);
       b = c / 28;
       c = c % 28;
       buffer.append(ChoSung[a]).append(JwungSung[b]);
       if (c != 0) // c가 0이 아니면, 즉 받침이 있으면
           buffer.append(JongSung[c]);
       }
   }
• schema.xml
   <fieldType name="jasoNgramFront" class="solr.TextField"
  positionIncrementGap="100">
       <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class=“KoreanJasoFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory"
             maxGramSize="50" minGramSize="1" side="front" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
       <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class=“KoreanJasoFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>
• index analyzer : 감기
    1. JasoFilter:
         ㄱㅏㅁㄱㅣ
    2. NgramFilter
         ㄱ
         ㄱㅏ
         ㄱㅏㅁ
         ㄱㅏㅁㄱ
         ㄱㅏㅁㄱㅣ


•   query analyzer : 가
    1. JasoFilter:
         ㄱㅏ
jQuery autocomplete
• http://jqueryui.com/autocomplete/
               <script>
               $(function() {
                   var availableTags = [
                      "Clojure",
                      "Java",
                      "JavaScript",
                      "Scala",
                   ];
                   $("#tags”).autocomplete({source: availableTags});
               });
               </script>
               <label for="tags">Tags: </label>
               <input id="tags" />
Ponket
• category schema.xml
  <fields>
    <field name="categoryid" type="int" indexed="true" stored="true"
  required="true" />
    <field name="title" type="string" indexed="false" stored="true"/>
    <field name="parentid" type="int" indexed="false" stored="true" />
    <field name="tag" type="text_ko" indexed="true" stored="true"
  termVectors="true" />
    <field name="regdttm" type="date" indexed="false" />
    <field name="moddttm" type="date" indexed="false" />
   </fields>
• catedeal schema.xml
   <fields>
    <field name="dealid" type="int" indexed="true" stored="true"
  required="true" />
    <field name="title" type="text_ko" indexed="true" stored="true"
        termVectors="true" />
    <field name="description" type="text_ko" indexed="true"
        termVectors="true“ stored="true" />
    <field name="categoryid" type="int" indexed="false" stored="true" />
    <field name="regdttm" type="date" indexed="false" />
    <field name="moddttm" type="date" indexed="false" />
   </fields>
Category Tag Data
• 카테고리 분류 흐름도
           Start

      Fetch deal data

         category
           검색


         category       예
           존재
               아니오
         catedeal
           검색


       Grouping by
       categoryid &     score 가장 높은
                                       catedeal 저장
         summing         category 선택
           score
                                           End
위치 검색
• Solr spatial search 이용
• Schema configuration
<fieldType name=“latlon" class="solr.LatLonType“
    subFieldSuffix="_coordinate"/>
<field name="location" type="latlon"
    indexed="true" stored="true"/>

• Data
<field name=“location">37.775,-122.4232</field>
<field name=“location">40.7143,-74.006</field>
• geofilt – The distance filter
   – pt: 위도, 경도 좌표
   – sfield: 위치 검색 필드
   – d: 거리(km)
  http://localhost:8983/solr/select?wt=json&...&q=*:*
  &fq={!geofilt pt=45.15,-93.85 sfield=location d=5}

• Search result
  "response":{"numFound":8,"start":0,"docs":[
     { "name":"Samsung SpinPoint P120 SP2514N - hard drive - 250
  GB - ATA-133",
       “location":"45.17614,-93.87341"},
     { "name":"Maxtor DiamondMax 11 - hard drive - 500 GB -
  SATA-300",
       “location":"45.17614,-93.87341"}, …
• Java
  SolrQuery query = new SolrQuery().
     setFilterQueries("{!geofilt pt="+spatial.getLocation()+
     " sfield=location d="+spatial.getDistance()+"}“);
검색?
어렵지 않아요.
Solr와 함께라면…
Contacts
wolfkang@gmail.com
http://cafe.naver.com/korlucene 질문과답변
http://facebook.com/kangdonghyeok
감사합니다.

More Related Content

What's hot

APEX Behind the Scenes by Scott Spendolini
APEX Behind the Scenes by Scott SpendoliniAPEX Behind the Scenes by Scott Spendolini
APEX Behind the Scenes by Scott Spendolini
Enkitec
 

What's hot (20)

MongoDB and Node.js
MongoDB and Node.jsMongoDB and Node.js
MongoDB and Node.js
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Basics of ssl
Basics of sslBasics of ssl
Basics of ssl
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Solr Introduction
Solr IntroductionSolr Introduction
Solr Introduction
 
Developing RESTful Web APIs with Python, Flask and MongoDB
Developing RESTful Web APIs with Python, Flask and MongoDBDeveloping RESTful Web APIs with Python, Flask and MongoDB
Developing RESTful Web APIs with Python, Flask and MongoDB
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Java EE 7 Batch processing in the Real World
Java EE 7 Batch processing in the Real WorldJava EE 7 Batch processing in the Real World
Java EE 7 Batch processing in the Real World
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
OAuth 2.0 Security Reinforced
OAuth 2.0 Security ReinforcedOAuth 2.0 Security Reinforced
OAuth 2.0 Security Reinforced
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Django
DjangoDjango
Django
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
APEX Behind the Scenes by Scott Spendolini
APEX Behind the Scenes by Scott SpendoliniAPEX Behind the Scenes by Scott Spendolini
APEX Behind the Scenes by Scott Spendolini
 
Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring Understanding of Apache kafka metrics for monitoring
Understanding of Apache kafka metrics for monitoring
 
An introduction to OAuth 2
An introduction to OAuth 2An introduction to OAuth 2
An introduction to OAuth 2
 

Similar to [제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Solr integration in Magento Enterprise
Solr integration in Magento EnterpriseSolr integration in Magento Enterprise
Solr integration in Magento Enterprise
Tobias Zander
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Similar to [제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자 (20)

Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Apache Solr Search Mastery
Apache Solr Search MasteryApache Solr Search Mastery
Apache Solr Search Mastery
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr integration in Magento Enterprise
Solr integration in Magento EnterpriseSolr integration in Magento Enterprise
Solr integration in Magento Enterprise
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Solr02 fields
Solr02 fieldsSolr02 fields
Solr02 fields
 
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, SematextSolr Anti-Patterns: Presented by Rafał Kuć, Sematext
Solr Anti-Patterns: Presented by Rafał Kuć, Sematext
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solr
 
Drupal for ng_os
Drupal for ng_osDrupal for ng_os
Drupal for ng_os
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Open Source Search: An Analysis
Open Source Search: An AnalysisOpen Source Search: An Analysis
Open Source Search: An Analysis
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

More from Donghyeok Kang (8)

Divi custom post type template
Divi custom post type templateDivi custom post type template
Divi custom post type template
 
My second word press plugin
My second word press pluginMy second word press plugin
My second word press plugin
 
My first word press plugin
My first word press pluginMy first word press plugin
My first word press plugin
 
Docker based web hosting
Docker based web hostingDocker based web hosting
Docker based web hosting
 
Flutter Beta but Better and Better
Flutter Beta but Better and BetterFlutter Beta but Better and Better
Flutter Beta but Better and Better
 
Java Annotation과 MyBatis로 나만의 ORM Framework을 만들어보자
Java Annotation과 MyBatis로 나만의 ORM Framework을 만들어보자Java Annotation과 MyBatis로 나만의 ORM Framework을 만들어보자
Java Annotation과 MyBatis로 나만의 ORM Framework을 만들어보자
 
워드프레스 플러그인 개발 입문
워드프레스 플러그인 개발 입문워드프레스 플러그인 개발 입문
워드프레스 플러그인 개발 입문
 
Curated News Platform
Curated News PlatformCurated News Platform
Curated News Platform
 

[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자

  • 1. Solr로 나만의 검색엔진을 만들어보자 - 제1회 루씬 한글분석기 기술세미나 - 강동혁 ㈜맥스트 책임연구원 2013-04-12
  • 2. ’02 대용량 텍스트 데이터베이스의 효율적인 점진적 클러스터링 기법 – 텍스트 마이닝 ‘09 VitaminMD – 사이트 통합검색 공개SW 공모대전 - 루씬한글분석기 ‘12 Ponket – 검색 및 쿠폰 자동분류 ‘13 Augmented Reality Platform - 위치검색
  • 3. PART I. 기초편 PART II. 활용편
  • 5. An introduction to Solr • Lucene 기반 – 텍스트 검색 엔진 라이브러리 – 다양한 텍스트 분석기 제공 – scoring 알고리즘 – text highlighter • Solr는 Lucene의 서버 버전 – 문서 색인 및 질의를 HTTP request로 처리 – 빠른 질의 성능을 위해 cache 사용 – 웹기반 운영툴 제공 – 스키마, 서버 설정 파일 – disjunction-max 질의 – more-like-this 플러그인 – 분산 서버
  • 6. Install & Deploy 1. http://lucene.apache.org/solr/ 접속한다. 2. 최신버전 (solr-4.2.1.zip) 다운로드한다. 3. 압축을 푼다. 4. cd example java –jar start.jar 5. http://localhost:8983/solr/admin 확인한다. • Apache Tomcat 이용 시 dist/solr-4.x.x.war 를 Apache Tomcat 의 webapps 에 복사한다.
  • 7. Schema & configuration file • schema.xml – 검색하고자 하는 데이터의 구조 – 데이터 필드 타입(<types>), 데이터 필드(<fields>) – unique identified, 기본 검색 필드, 기본 연산자(AND/OR) – 필드 복사 • solrconfig.xml – 색인, 질의를 위한 파라미터 – <requestHandler>
  • 8. schema.xml <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> … </type> <fields> <field name="id" type="string" indexed="true" stored="true“ required="true" /> … </fields> <uniqueKey>id</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="category" dest="text"/> <copyField source="name" dest="text"/>
  • 9. schema.xml types <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true“ omitNorms="true"/> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <fieldType name="integer" class="solr.IntField" omitNorms="true"/> <fieldType name="long" class="solr.LongField" omitNorms="true"/> <fieldType name="float" class="solr.FloatField" omitNorms="true"/> <fieldType name="double" class="solr.DoubleField" omitNorms="true“/> </type>
  • 10. schema.xml fields <fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="sku" type="textTight" indexed="true" stored="true" omitNorms="true"/> <field name="name" type="text" indexed="true" stored="true"/> <field name="nameSort" type="string" indexed="true" stored="false"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/> </fields>
  • 11. Custom field types <fieldType name="text" class="solr.TextField“ positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> … </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> … </analyzer> </fieldType>
  • 12. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" expand="false“ synonyms="index_synonyms.txt" ignoreCase="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> … </analyzer> </fieldType>
  • 13. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> … </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" expand="true“ synonyms="synonyms.txt" ignoreCase="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0“ generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
  • 14. Text analysis 1. <tokenizer class="solr.WhitespaceTokenizerFactory"/> 2. <filter class="solr.SynonymFilterFactory" synonyms=“synonyms.txt“ ignoreCase="true" expand="false"/> 3. <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> 4. <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1“ generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0“ splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> Input) I can fly with SolrSearch if you can. 1. I, can, fly, with, SolrSearch, if, you, can 2. I, can, run, with, SolrSearch, if, you, can (“fly=>run” in synonyms.txt) 3. I, can, run, SolrSearch, if, you, can (“with” in stopwords.txt) 4. I, can, run, Solr, Search, if, you, can 5. i, can, run, solr, search, if, you, can 6. i, can, run, solr, search, if, you
  • 15.
  • 16. Indexing data • http://localhost:8983/solr/update • Sending XML to Solr <add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> • Deleting documents by ID and by Query <delete><id>05991</id></delete> <delete><query>office:Bridgewater</query></delete>
  • 17. DataImportHandler • XML 또는 relational DB 에서 직접 Solr 로 data를 import • solrconfig.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" > <lst name="defaults"> <str name="config“>data-config.xml</str> </lst> </requestHandler> • http://localhost:8983/solr/dataimport
  • 18. Direct database import • data-config.xml <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex“ /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="PRICE" name="price" /> </entity> </document> </dataConfig>
  • 19. Delta-import command • 마지막 import 이후에 생성된 데이터만 import • conf/dataimport.properties 에 마지막 import 시간 • http://localhost:8983/solr/dataimport?command=delta- import
  • 20. • data-config.xml <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" pk="ID" query="select * from item" deltaImportQuery="select * from item where ID='${dih.delta.id}'" deltaQuery="select id from item where last_modified &gt; '${dih.last_index_time}'"> </entity> </document> </dataConfig>
  • 21. Basic searching • http://localhost:8983/solr/select?<query parameters> – q: query string – q.op: 질의 연산자(AND/OR) – sort: 정렬 방식(필드명 asc[desc]) – qt: query type(solrconfig.xml) – start: 시작 index(default: 0) – row: 검색 결과에 포함되는 row 수(default: 10) – wt: response format(xml, json, javabin, python, etc) – hl: 하이라이팅 여부 – hl.fl: 하이라이팅 필드 q=video&fl=name,id q=video&sort=price desc q=video card&fl=name,id&hl=true&hl.fl=name,features q=video&fl=name,id&wt=json q=video&fl=name,id&start=20&row=10
  • 22. Request handlers • solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- <int name="rows">10</int> <str name="fl">*</str> <str name="version">2.1</str> --> </lst> </requestHandler>
  • 23. Dismax query • Disjunction Max query – Disjunction: 질의가 여러 필드를 대상으로 수행 – Max: 검색 대상 필드들의 가중치에 따라 score결정 • Query parameters – qf: 필드 목록 및 가중치(예, subject^2.3 content tag^1.2) – mm, pf, ps, tie, bq 등
  • 24. Dismax query handler • solrconfig.xml <requestHandler name="dismax" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="sort">regdttm desc</str> <float name="tie">0.01</float> <str name="qf">content^0.5 subject^2.0</str> <str name="pf">content^0.5 subject^2.0</str> <str name="fl">*</str> <bool name="hl">true</bool> <str name="hl.fl">subject content</str> </lst> </requestHandler>
  • 26. MLT query • Query parameters – mlt: MLT 여부(true/false) – mlt.count: MLT 검색 결과 row수(default:5) – mlt.fl: MLT 검색에 사용할 필드 – mlt.qf: 필드 가중치
  • 27. MLT handler • solrconfig.xml <requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <int name="rows">5</int> <str name="fl">qid,subject</str> <bool name="mlt">true</bool> <str name="mlt.fl">subject,content</str> <bool name="mlt.boost">true</bool> <str name="mlt.qf“>content^0.5 subject^2.0</str> </lst> </requestHandler>
  • 28. Solr cores • 복수의 데이터 색인 • Single core – http://localhost:8983/solr/select?q=lucene • Multicore – http://localhost:8983/solr/news/select?q=ios – http://localhost:8983/solr/blog/select?q=android • solr.xml <solr sharedLib="lib"> <cores adminPath="/admin/cores"> <core name=“news" instanceDir=“news" /> <core name=“blog" instanceDir=“blog" /> </cores> </solr>
  • 29.
  • 31.
  • 35. • 의학용어 사전 구축 – compounds.dic : 937건 가는근육잔섬유:가는,근육,잔,섬유 가려움약:가려움,약 가로무늬근:가로무늬,근 고알도스테론증:고,알도스테론,증 고요산혈증:고,요산,혈증 과민성장증후군:과민성,장,증후,군 근육위축가쪽경화증:근육,위축,가쪽,경화증 뇌없음증:뇌,없음,증 헛배부름:헛,배,부름 판토록:판토록 – extension.dic : 1,071건 가드너,10000X 가려움,10000X 가로막,10000X 가습기,10000X 가와사키,10000X 가족성,10000X 가학성,10000X 각기병,10000X 각화,10000X
  • 37. 복합명사 분해 • 복합명사 검색 – “만성간염”이 “만성”, “간염”으로 분해 – 검색결과: “B형간염”, “A형간염”, “간염”, “만성간염” • AND 검색을 위해 Korlucene 수정 – 수정 전 – 수정 후 – 참고) http://cafe.naver.com/korlucene/116
  • 38. Keyword suggestion • 자소분해 : 초성, 중성, 종성 예) 기미 = ㄱ+ㅣ+ㅁ+ㅣ • UTF-8 한글 코드 테이블 – 초성: 19 – 중성: 21 – 종성: 28
  • 39. KoreanJasoFilter.java int a, b, c; // 자소 버퍼: 초성/중성/종성 순 for (int i = 0; i < termLength; i++) { char ch = termBuffer[i]; if (ch >= 0xAC00 && ch <= 0xD7A3) { // "AC00:가" ~ "D7A3:힣" 에 속한 글자면 분해 c = ch - 0xAC00; a = c / (21 * 28); c = c % (21 * 28); b = c / 28; c = c % 28; buffer.append(ChoSung[a]).append(JwungSung[b]); if (c != 0) // c가 0이 아니면, 즉 받침이 있으면 buffer.append(JongSung[c]); } }
  • 40. • schema.xml <fieldType name="jasoNgramFront" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=“KoreanJasoFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="50" minGramSize="1" side="front" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=“KoreanJasoFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 41. • index analyzer : 감기 1. JasoFilter: ㄱㅏㅁㄱㅣ 2. NgramFilter ㄱ ㄱㅏ ㄱㅏㅁ ㄱㅏㅁㄱ ㄱㅏㅁㄱㅣ • query analyzer : 가 1. JasoFilter: ㄱㅏ
  • 42. jQuery autocomplete • http://jqueryui.com/autocomplete/ <script> $(function() { var availableTags = [ "Clojure", "Java", "JavaScript", "Scala", ]; $("#tags”).autocomplete({source: availableTags}); }); </script> <label for="tags">Tags: </label> <input id="tags" />
  • 44. • category schema.xml <fields> <field name="categoryid" type="int" indexed="true" stored="true" required="true" /> <field name="title" type="string" indexed="false" stored="true"/> <field name="parentid" type="int" indexed="false" stored="true" /> <field name="tag" type="text_ko" indexed="true" stored="true" termVectors="true" /> <field name="regdttm" type="date" indexed="false" /> <field name="moddttm" type="date" indexed="false" /> </fields> • catedeal schema.xml <fields> <field name="dealid" type="int" indexed="true" stored="true" required="true" /> <field name="title" type="text_ko" indexed="true" stored="true" termVectors="true" /> <field name="description" type="text_ko" indexed="true" termVectors="true“ stored="true" /> <field name="categoryid" type="int" indexed="false" stored="true" /> <field name="regdttm" type="date" indexed="false" /> <field name="moddttm" type="date" indexed="false" /> </fields>
  • 46. • 카테고리 분류 흐름도 Start Fetch deal data category 검색 category 예 존재 아니오 catedeal 검색 Grouping by categoryid & score 가장 높은 catedeal 저장 summing category 선택 score End
  • 47. 위치 검색 • Solr spatial search 이용 • Schema configuration <fieldType name=“latlon" class="solr.LatLonType“ subFieldSuffix="_coordinate"/> <field name="location" type="latlon" indexed="true" stored="true"/> • Data <field name=“location">37.775,-122.4232</field> <field name=“location">40.7143,-74.006</field>
  • 48. • geofilt – The distance filter – pt: 위도, 경도 좌표 – sfield: 위치 검색 필드 – d: 거리(km) http://localhost:8983/solr/select?wt=json&...&q=*:* &fq={!geofilt pt=45.15,-93.85 sfield=location d=5} • Search result "response":{"numFound":8,"start":0,"docs":[ { "name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133", “location":"45.17614,-93.87341"}, { "name":"Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300", “location":"45.17614,-93.87341"}, … • Java SolrQuery query = new SolrQuery(). setFilterQueries("{!geofilt pt="+spatial.getLocation()+ " sfield=location d="+spatial.getDistance()+"}“);