Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자

[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자

  1. 1. Solr로 나만의 검색엔진을 만들어보자 - 제1회 루씬 한글분석기 기술세미나 - 강동혁 ㈜맥스트 책임연구원 2013-04-12
  2. 2. ’02대용량 텍스트 데이터베이스의효율적인 점진적 클러스터링 기법 – 텍스트 마이닝‘09VitaminMD – 사이트 통합검색공개SW 공모대전 - 루씬한글분석기‘12Ponket – 검색 및 쿠폰 자동분류‘13Augmented Reality Platform - 위치검색
  3. 3. PART I. 기초편PART II. 활용편
  4. 4. PART I. 기초편
  5. 5. An introduction to Solr• Lucene 기반 – 텍스트 검색 엔진 라이브러리 – 다양한 텍스트 분석기 제공 – scoring 알고리즘 – text highlighter• Solr는 Lucene의 서버 버전 – 문서 색인 및 질의를 HTTP request로 처리 – 빠른 질의 성능을 위해 cache 사용 – 웹기반 운영툴 제공 – 스키마, 서버 설정 파일 – disjunction-max 질의 – more-like-this 플러그인 – 분산 서버
  6. 6. Install & Deploy1. http://lucene.apache.org/solr/ 접속한다.2. 최신버전 (solr-4.2.1.zip) 다운로드한다.3. 압축을 푼다.4. cd example java –jar start.jar5. http://localhost:8983/solr/admin 확인한다.• Apache Tomcat 이용 시 dist/solr-4.x.x.war 를 Apache Tomcat 의 webapps 에 복사한다.
  7. 7. Schema & configuration file• schema.xml – 검색하고자 하는 데이터의 구조 – 데이터 필드 타입(<types>), 데이터 필드(<fields>) – unique identified, 기본 검색 필드, 기본 연산자(AND/OR) – 필드 복사• solrconfig.xml – 색인, 질의를 위한 파라미터 – <requestHandler>
  8. 8. schema.xml<types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> …</type><fields> <field name="id" type="string" indexed="true" stored="true“ required="true" /> …</fields><uniqueKey>id</uniqueKey><defaultSearchField>text</defaultSearchField><copyField source="category" dest="text"/><copyField source="name" dest="text"/>
  9. 9. schema.xml types<types> <fieldType name="string" class="solr.StrField" sortMissingLast="true“ omitNorms="true"/> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <fieldType name="integer" class="solr.IntField" omitNorms="true"/> <fieldType name="long" class="solr.LongField" omitNorms="true"/> <fieldType name="float" class="solr.FloatField" omitNorms="true"/> <fieldType name="double" class="solr.DoubleField" omitNorms="true“/></type>
  10. 10. schema.xml fields<fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="sku" type="textTight" indexed="true" stored="true" omitNorms="true"/> <field name="name" type="text" indexed="true" stored="true"/> <field name="nameSort" type="string" indexed="true" stored="false"/> <field name="text" type="text" indexed="true" stored="false" multiValued="true"/></fields>
  11. 11. Custom field types <fieldType name="text" class="solr.TextField“ positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> … </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/> … </analyzer></fieldType>
  12. 12. <fieldType name="text" class="solr.TextField"positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" expand="false“ synonyms="index_synonyms.txt" ignoreCase="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> … </analyzer> </fieldType>
  13. 13. <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> … </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" expand="true“ synonyms="synonyms.txt" ignoreCase="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0“ generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
  14. 14. Text analysis1. <tokenizer class="solr.WhitespaceTokenizerFactory"/>2. <filter class="solr.SynonymFilterFactory" synonyms=“synonyms.txt“ ignoreCase="true" expand="false"/>3. <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>4. <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1“ generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0“ splitOnCaseChange="1"/>5. <filter class="solr.LowerCaseFilterFactory"/>6. <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>Input) I can fly with SolrSearch if you can.1. I, can, fly, with, SolrSearch, if, you, can2. I, can, run, with, SolrSearch, if, you, can (“fly=>run” in synonyms.txt)3. I, can, run, SolrSearch, if, you, can (“with” in stopwords.txt)4. I, can, run, Solr, Search, if, you, can5. i, can, run, solr, search, if, you, can6. i, can, run, solr, search, if, you
  15. 15. Indexing data• http://localhost:8983/solr/update• Sending XML to Solr <add> <doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add>• Deleting documents by ID and by Query <delete><id>05991</id></delete> <delete><query>office:Bridgewater</query></delete>
  16. 16. DataImportHandler• XML 또는 relational DB 에서 직접 Solr 로 data를 import• solrconfig.xml <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler" > <lst name="defaults"> <str name="config“>data-config.xml</str> </lst> </requestHandler>• http://localhost:8983/solr/dataimport
  17. 17. Direct database import• data-config.xml <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex“ /> <document name="products"> <entity name="item" query="select * from item"> <field column="ID" name="id" /> <field column="NAME" name="name" /> <field column="MANU" name="manu" /> <field column="PRICE" name="price" /> </entity> </document> </dataConfig>
  18. 18. Delta-import command• 마지막 import 이후에 생성된 데이터만 import• conf/dataimport.properties 에 마지막 import 시간• http://localhost:8983/solr/dataimport?command=delta- import
  19. 19. • data-config.xml <dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" pk="ID" query="select * from item" deltaImportQuery="select * from item where ID=${dih.delta.id}" deltaQuery="select id from item where last_modified &gt; ${dih.last_index_time}"> </entity> </document> </dataConfig>
  20. 20. Basic searching• http://localhost:8983/solr/select?<query parameters> – q: query string – q.op: 질의 연산자(AND/OR) – sort: 정렬 방식(필드명 asc[desc]) – qt: query type(solrconfig.xml) – start: 시작 index(default: 0) – row: 검색 결과에 포함되는 row 수(default: 10) – wt: response format(xml, json, javabin, python, etc) – hl: 하이라이팅 여부 – hl.fl: 하이라이팅 필드 q=video&fl=name,id q=video&sort=price desc q=video card&fl=name,id&hl=true&hl.fl=name,features q=video&fl=name,id&wt=json q=video&fl=name,id&start=20&row=10
  21. 21. Request handlers• solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- <int name="rows">10</int> <str name="fl">*</str> <str name="version">2.1</str> --> </lst> </requestHandler>
  22. 22. Dismax query• Disjunction Max query – Disjunction: 질의가 여러 필드를 대상으로 수행 – Max: 검색 대상 필드들의 가중치에 따라 score결정• Query parameters – qf: 필드 목록 및 가중치(예, subject^2.3 content tag^1.2) – mm, pf, ps, tie, bq 등
  23. 23. Dismax query handler• solrconfig.xml <requestHandler name="dismax" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="sort">regdttm desc</str> <float name="tie">0.01</float> <str name="qf">content^0.5 subject^2.0</str> <str name="pf">content^0.5 subject^2.0</str> <str name="fl">*</str> <bool name="hl">true</bool> <str name="hl.fl">subject content</str> </lst> </requestHandler>
  24. 24. More-like-this(MLT) search• 연관 검색
  25. 25. MLT query• Query parameters – mlt: MLT 여부(true/false) – mlt.count: MLT 검색 결과 row수(default:5) – mlt.fl: MLT 검색에 사용할 필드 – mlt.qf: 필드 가중치
  26. 26. MLT handler• solrconfig.xml <requestHandler name="mlt" class="solr.MoreLikeThisHandler"> <lst name="defaults"> <int name="rows">5</int> <str name="fl">qid,subject</str> <bool name="mlt">true</bool> <str name="mlt.fl">subject,content</str> <bool name="mlt.boost">true</bool> <str name="mlt.qf“>content^0.5 subject^2.0</str> </lst> </requestHandler>
  27. 27. Solr cores• 복수의 데이터 색인• Single core – http://localhost:8983/solr/select?q=lucene• Multicore – http://localhost:8983/solr/news/select?q=ios – http://localhost:8983/solr/blog/select?q=android• solr.xml <solr sharedLib="lib"> <cores adminPath="/admin/cores"> <core name=“news" instanceDir=“news" /> <core name=“blog" instanceDir=“blog" /> </cores> </solr>
  28. 28. PART II. 활용편
  29. 29. 과민성장증후군
  30. 30. 과민/성장/증후군?
  31. 31. 과민성/장/증후군
  32. 32. • 의학용어 사전 구축 – compounds.dic : 937건 가는근육잔섬유:가는,근육,잔,섬유 가려움약:가려움,약 가로무늬근:가로무늬,근 고알도스테론증:고,알도스테론,증 고요산혈증:고,요산,혈증 과민성장증후군:과민성,장,증후,군 근육위축가쪽경화증:근육,위축,가쪽,경화증 뇌없음증:뇌,없음,증 헛배부름:헛,배,부름 판토록:판토록 – extension.dic : 1,071건 가드너,10000X 가려움,10000X 가로막,10000X 가습기,10000X 가와사키,10000X 가족성,10000X 가학성,10000X 각기병,10000X 각화,10000X
  33. 33. 만성간염
  34. 34. 복합명사 분해• 복합명사 검색 – “만성간염”이 “만성”, “간염”으로 분해 – 검색결과: “B형간염”, “A형간염”, “간염”, “만성간염”• AND 검색을 위해 Korlucene 수정 – 수정 전 – 수정 후 – 참고) http://cafe.naver.com/korlucene/116
  35. 35. Keyword suggestion • 자소분해 : 초성, 중성, 종성 예) 기미 = ㄱ+ㅣ+ㅁ+ㅣ • UTF-8 한글 코드 테이블 – 초성: 19 – 중성: 21 – 종성: 28
  36. 36. KoreanJasoFilter.javaint a, b, c; // 자소 버퍼: 초성/중성/종성 순for (int i = 0; i < termLength; i++) { char ch = termBuffer[i]; if (ch >= 0xAC00 && ch <= 0xD7A3) { // "AC00:가" ~ "D7A3:힣" 에 속한 글자면 분해 c = ch - 0xAC00; a = c / (21 * 28); c = c % (21 * 28); b = c / 28; c = c % 28; buffer.append(ChoSung[a]).append(JwungSung[b]); if (c != 0) // c가 0이 아니면, 즉 받침이 있으면 buffer.append(JongSung[c]); } }
  37. 37. • schema.xml <fieldType name="jasoNgramFront" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=“KoreanJasoFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="50" minGramSize="1" side="front" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class=“KoreanJasoFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  38. 38. • index analyzer : 감기 1. JasoFilter: ㄱㅏㅁㄱㅣ 2. NgramFilter ㄱ ㄱㅏ ㄱㅏㅁ ㄱㅏㅁㄱ ㄱㅏㅁㄱㅣ• query analyzer : 가 1. JasoFilter: ㄱㅏ
  39. 39. jQuery autocomplete• http://jqueryui.com/autocomplete/ <script> $(function() { var availableTags = [ "Clojure", "Java", "JavaScript", "Scala", ]; $("#tags”).autocomplete({source: availableTags}); }); </script> <label for="tags">Tags: </label> <input id="tags" />
  40. 40. Ponket
  41. 41. • category schema.xml <fields> <field name="categoryid" type="int" indexed="true" stored="true" required="true" /> <field name="title" type="string" indexed="false" stored="true"/> <field name="parentid" type="int" indexed="false" stored="true" /> <field name="tag" type="text_ko" indexed="true" stored="true" termVectors="true" /> <field name="regdttm" type="date" indexed="false" /> <field name="moddttm" type="date" indexed="false" /> </fields>• catedeal schema.xml <fields> <field name="dealid" type="int" indexed="true" stored="true" required="true" /> <field name="title" type="text_ko" indexed="true" stored="true" termVectors="true" /> <field name="description" type="text_ko" indexed="true" termVectors="true“ stored="true" /> <field name="categoryid" type="int" indexed="false" stored="true" /> <field name="regdttm" type="date" indexed="false" /> <field name="moddttm" type="date" indexed="false" /> </fields>
  42. 42. Category Tag Data
  43. 43. • 카테고리 분류 흐름도 Start Fetch deal data category 검색 category 예 존재 아니오 catedeal 검색 Grouping by categoryid & score 가장 높은 catedeal 저장 summing category 선택 score End
  44. 44. 위치 검색• Solr spatial search 이용• Schema configuration<fieldType name=“latlon" class="solr.LatLonType“ subFieldSuffix="_coordinate"/><field name="location" type="latlon" indexed="true" stored="true"/>• Data<field name=“location">37.775,-122.4232</field><field name=“location">40.7143,-74.006</field>
  45. 45. • geofilt – The distance filter – pt: 위도, 경도 좌표 – sfield: 위치 검색 필드 – d: 거리(km) http://localhost:8983/solr/select?wt=json&...&q=*:* &fq={!geofilt pt=45.15,-93.85 sfield=location d=5}• Search result "response":{"numFound":8,"start":0,"docs":[ { "name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133", “location":"45.17614,-93.87341"}, { "name":"Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300", “location":"45.17614,-93.87341"}, …• Java SolrQuery query = new SolrQuery(). setFilterQueries("{!geofilt pt="+spatial.getLocation()+ " sfield=location d="+spatial.getDistance()+"}“);
  46. 46. 검색?
  47. 47. 어렵지 않아요.Solr와 함께라면…
  48. 48. Contactswolfkang@gmail.comhttp://cafe.naver.com/korlucene 질문과답변http://facebook.com/kangdonghyeok
  49. 49. 감사합니다.

×