Nori: The Official Elasticsearch Plugin for Korean Language Analysis

Jim Ferenczi • Kiju Kim
Feb 22, 2019
Nori: The Official Elasticsearch Plugin for Korean
Language Analysis

!2
Kiju Kim
Sr. Support Engineer
• “엘라스틱서치 6.2를 이용한 한국어, 중
국어, 일본어 검색”
• “Elasticsearch Machine Learning을
이용한 다국어 로그 분류”
• “어떤 한국어 분석기를 사용할까?”
• S사 시스템 에어컨 다국어 입력기
• 신경망과 의존 문법을 이용한 구문 분석

Jim Ferenczi
Principal Software Engineer
• Elasticsearch
• Lucene PMC

세종대왕, 1443년
이미지 출처: https://ko.wikipedia.org/wiki/훈민정음_언해

!6
Analyzers
Elasticsearch has 40+ Language Analyzers
PUT test/_doc/1
{
“message": "나라의 말이 중
국과 달라"
}
{
"token" : "나라의",
…
},
{
"token" : "말이",
…
},
…
GET test/_search
{
"query": {
"match": {
"message": "나라"
}
}
}
Original Text Standard Analyzer Query
No Hit !

!7
Open Source Korean Analyzers
나라의 말이 중국과 달라 => 나라, 말, 중국, 다르
• 유영호, 이용운
• License: Apache 2.0
• mecab-ko-dic 기반으로 만들어진
JVM 상에서 돌아가는 한국어 형태
소분석기입니다.
• 기본적으로 Java와 Scala 인터페
이스를 제공합니다.
• 이수명, 정호욱
• 본 프로젝트는 Lucene
KoreanAnalyzer를 개발하여 국
내에서 루씬의 활용도를 높이고자
합니다.
• 유호현
• 오픈소스 한국어 처리기 (Official
Fork of twitter-korean-text)
• 스칼라로 쓰여진 한국어 처리기입
니다. 현재 텍스트 정규화와 형태
소 분석, 스테밍을 지원하고 있습
니다.
Seunjeon Arirang Open-korean-text

!8
Blog Post on Korean Analyzers
https://www.elastic.co/kr/blog/using-korean-analyzers

!9
Find a Way to
Properly
Support
Korean!

!10
Journey to Nori
Dec
2017
Dec
2017
Feb
2018
Apr
2018
Aug
2018
Surveyed existing open
source Korean analyzers
First POC of Nori Machine Learning
started to support
Korean (6.2)
Merged the initial
prototype of Nori into
Lucene
Announced Nori with
Elasticsearch 6.4 !!!
https://issues.apache.org/jira/browse/LUCENE-8231
Decided to develop a new Korean analyzer !

!12
공자 앞에서 문자 쓴다

!13
Nori, the genesis
• MeCab
• An open source text segmentation library
• Language agnostic
• Dictionaries for Japanese (ChaSen, IPADIC, …)
• Lucene Kuromoji Analyzer
• Based on MeCab and IPADIC
• Implements the segmentation part of the MeCab library in Java
• Viterbi

!14
Nori, the genesis
• mecab-ko-dic:
• Created by Yongwoon Lee and Yungho Yu
• Apache 2 License
• A morphological dictionary for Korean language using MeCab
• More than 800,000 entries:
• 놀이,1781,3534,750,NNG,*,F,놀이,*,*,*,*
• 3815 left ids, 2690 right ids (connection costs 3815*2690)
• Used by Seunjeon
• 200MB uncompressed

!15
Nori, Binary Dictionary
• Finite State Transducer (FST):
• Prefix and infix compression for Hangul and Hanja
• UTF-16 encoding
• 5.4MB
• Connections costs:
• 3815 left ids, 2690 right ids (connection costs 3815*2690)
• One short (16 bits) per cell
• 20MB loaded in a direct byte buffer outside of the heap
• Feature encoding:
• Custom binary encoding
• 9 bytes per entry (7MB total)

!16
Nori, morphological analysis
• Viterbi algorithm
• Lattice:
• Find the best possible segmentation
• Backtrace when only one path is alive
• Example:
• 21세기 세종계획
• 21 + 세기 (century) + 세종 (Sejong) + 계획 (Plan)

!18
Nori, user dictionary
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nori_user_dict": {
"type": "nori_tokenizer",
"decompound_mode": "mixed",
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "nori_user_dict"
}
}
}
}
}
}
Additional nouns (NNG)

!19
Nori, custom dictionary
• If your domain-specific vocabulary is big (several thousand), rebuilding
the original dictionary with your extra rules can:
• Lower the memory usage compared to the user-dic approach
• Speed up the creation of the analyzer/tokenizer
• Speed up the analysis
• How to create a distribution that uses a custom dictionary:
• https://github.com/jimczi/nori/blob/master/how-to-custom-
dict.asciidoc

!20
Additional filters
• nori_part_of_speech token filter:
• Removes tokens that match a set of part-of-speech tags
• List of part of speech tags:
• http://lucene.apache.org/core/7_7_0/analyzers-nori/org/apache/lucene/analysis/ko/
POS.Tag.html
• Defaults to:
"stoptags": [
"E",
"IC",
"J",
"MAG", "MAJ", "MM",
"SP", "SSC", "SSO", "SC", "SE",
"XPN", "XSA", "XSN", "XSV",
"UNA", "NA", "VSV"
]
Part of speech filter

!21
Additional filters
• nori_readingform token filter:
• Example: 鄕歌 => 향가 (hyang-ga)
Hanja to Hangul

!22
Test the analysis
GET _analyze
{
"tokenizer": "nori_tokenizer",
"text": "뿌리가 깊은 나무는",
"attributes" : ["posType", "leftPOS", "rightPOS",
"morphemes", "reading"],
"explain": true
}
Debug ?

!23
Test the analysis
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "nori_tokenizer",
"tokens": [
{
"token": "뿌리",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
}, …
Debug ?

!24
Nori, future
• Keep N best segmentations
• Upgrade to the latest mecab-ko-dic
• Find longest token in the user dictionary
• Community:
• Lucene Jira: https://issues.apache.org/jira/
• Elasticsearch Github: https://github.com/elastic/elasticsearch
• Your idea here !
New features and community

Nori: The Official Elasticsearch Plugin for Korean Language Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nori: The Official Elasticsearch Plugin for Korean Language Analysis

Similar to Nori: The Official Elasticsearch Plugin for Korean Language Analysis (20)

More from Elasticsearch

More from Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Nori: The Official Elasticsearch Plugin for Korean Language Analysis