SlideShare a Scribd company logo
Jim Ferenczi • Kiju Kim
Feb 22, 2019
Nori: The Official Elasticsearch Plugin for Korean
Language Analysis
!2
Kiju Kim
Sr. Support Engineer
• “엘라스틱서치 6.2를 이용한 한국어, 중
국어, 일본어 검색”
• “Elasticsearch Machine Learning을
이용한 다국어 로그 분류”
• “어떤 한국어 분석기를 사용할까?”
• S사 시스템 에어컨 다국어 입력기
• 신경망과 의존 문법을 이용한 구문 분석
Jim Ferenczi
Principal Software Engineer
• Elasticsearch
• Lucene PMC
!4
바야흐로 1443년
세종대왕, 1443년
이미지 출처: https://ko.wikipedia.org/wiki/훈민정음_언해
!6
Analyzers
Elasticsearch has 40+ Language Analyzers
PUT test/_doc/1
{
“message": "나라의 말이 중
국과 달라"
}
{
"token" : "나라의",
…
},
{
"token" : "말이",
…
},
…
GET test/_search
{
"query": {
"match": {
"message": "나라"
}
}
}
Original Text Standard Analyzer Query
No Hit !
!7
Open Source Korean Analyzers
나라의 말이 중국과 달라 => 나라, 말, 중국, 다르
• 유영호, 이용운
• License: Apache 2.0
• mecab-ko-dic 기반으로 만들어진
JVM 상에서 돌아가는 한국어 형태
소분석기입니다.
• 기본적으로 Java와 Scala 인터페
이스를 제공합니다.
• 이수명, 정호욱
• License: Apache 2.0
• 본 프로젝트는 Lucene
KoreanAnalyzer를 개발하여 국
내에서 루씬의 활용도를 높이고자
합니다.
• 유호현
• License: Apache 2.0
• 오픈소스 한국어 처리기 (Official
Fork of twitter-korean-text)
• 스칼라로 쓰여진 한국어 처리기입
니다. 현재 텍스트 정규화와 형태
소 분석, 스테밍을 지원하고 있습
니다.
Seunjeon Arirang Open-korean-text
!8
Blog Post on Korean Analyzers
https://www.elastic.co/kr/blog/using-korean-analyzers
!9
Find a Way to
Properly
Support
Korean!
!10
Journey to Nori
Dec
2017
Dec
2017
Feb
2018
Apr
2018
Aug
2018
Surveyed existing open
source Korean analyzers
First POC of Nori Machine Learning
started to support
Korean (6.2)
Merged the initial
prototype of Nori into
Lucene
Announced Nori with
Elasticsearch 6.4 !!!
https://issues.apache.org/jira/browse/LUCENE-8231
Decided to develop a new Korean analyzer !
!11
Nori
“놀이”
!12
공자 앞에서 문자 쓴다
!13
Nori, the genesis
• MeCab
• An open source text segmentation library
• Language agnostic
• Dictionaries for Japanese (ChaSen, IPADIC, …)
• Lucene Kuromoji Analyzer
• Based on MeCab and IPADIC
• Implements the segmentation part of the MeCab library in Java
• Viterbi
!14
Nori, the genesis
• mecab-ko-dic:
• Created by Yongwoon Lee and Yungho Yu
• Apache 2 License
• A morphological dictionary for Korean language using MeCab
• More than 800,000 entries:
• 놀이,1781,3534,750,NNG,*,F,놀이,*,*,*,*
• 3815 left ids, 2690 right ids (connection costs 3815*2690)
• Used by Seunjeon
• 200MB uncompressed
!15
Nori, Binary Dictionary
• Finite State Transducer (FST):
• Prefix and infix compression for Hangul and Hanja
• UTF-16 encoding
• 5.4MB
• Connections costs:
• 3815 left ids, 2690 right ids (connection costs 3815*2690)
• One short (16 bits) per cell
• 20MB loaded in a direct byte buffer outside of the heap
• Feature encoding:
• Custom binary encoding
• 9 bytes per entry (7MB total)
!16
Nori, morphological analysis
• Viterbi algorithm
• Lattice:
• Find the best possible segmentation
• Backtrace when only one path is alive
• Example:
• 21세기 세종계획
• 21 + 세기 (century) + 세종 (Sejong) + 계획 (Plan)
!17
!18
Nori, user dictionary
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nori_user_dict": {
"type": "nori_tokenizer",
"decompound_mode": "mixed",
"user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "nori_user_dict"
}
}
}
}
}
}
Additional nouns (NNG)
!19
Nori, custom dictionary
• If your domain-specific vocabulary is big (several thousand), rebuilding
the original dictionary with your extra rules can:
• Lower the memory usage compared to the user-dic approach
• Speed up the creation of the analyzer/tokenizer
• Speed up the analysis
• How to create a distribution that uses a custom dictionary:
• https://github.com/jimczi/nori/blob/master/how-to-custom-
dict.asciidoc
!20
Additional filters
• nori_part_of_speech token filter:
• Removes tokens that match a set of part-of-speech tags
• List of part of speech tags:
• http://lucene.apache.org/core/7_7_0/analyzers-nori/org/apache/lucene/analysis/ko/
POS.Tag.html
• Defaults to:
"stoptags": [
"E",
"IC",
"J",
"MAG", "MAJ", "MM",
"SP", "SSC", "SSO", "SC", "SE",
"XPN", "XSA", "XSN", "XSV",
"UNA", "NA", "VSV"
]
Part of speech filter
!21
Additional filters
• nori_readingform token filter:
• Example: 鄕歌 => 향가 (hyang-ga)
Hanja to Hangul
!22
Test the analysis
GET _analyze
{
"tokenizer": "nori_tokenizer",
"text": "뿌리가 깊은 나무는",
"attributes" : ["posType", "leftPOS", "rightPOS",
"morphemes", "reading"],
"explain": true
}
Debug ?
!23
Test the analysis
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "nori_tokenizer",
"tokens": [
{
"token": "뿌리",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
}, …
Debug ?
!24
Nori, future
• Keep N best segmentations
• Upgrade to the latest mecab-ko-dic
• Find longest token in the user dictionary
• Community:
• Lucene Jira: https://issues.apache.org/jira/
• Elasticsearch Github: https://github.com/elastic/elasticsearch
• Your idea here !
New features and community
!25
고맙습니다

More Related Content

What's hot

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
Michael Bayer Introduction to SQLAlchemy @ Postgres OpenMichael Bayer Introduction to SQLAlchemy @ Postgres Open
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 

What's hot (20)

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~F#入門 ~関数プログラミングとは何か~
F#入門 ~関数プログラミングとは何か~
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
PostgreSQL'i öğrenmek ve yönetmek
PostgreSQL'i öğrenmek ve yönetmekPostgreSQL'i öğrenmek ve yönetmek
PostgreSQL'i öğrenmek ve yönetmek
 
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
[236] 스트림 저장소 최적화 이야기: 아파치 드루이드로부터 얻은 교훈
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
 
JSON and the Oracle Database
JSON and the Oracle DatabaseJSON and the Oracle Database
JSON and the Oracle Database
 
pg_standbyの今後について(第19回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_standbyの今後について(第19回PostgreSQLアンカンファレンス@オンライン 発表資料)pg_standbyの今後について(第19回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_standbyの今後について(第19回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱
[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱
[pgday.Seoul 2022] POSTGRES 테스트코드로 기여하기 - 이동욱
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
PostgreSQL Performance Tables Partitioning vs. Aggregated Data TablesPostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
 
In-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction supportIn-memory OLTP storage with persistence and transaction support
In-memory OLTP storage with persistence and transaction support
 
EDB Postgres DBA Best Practices
EDB Postgres DBA Best PracticesEDB Postgres DBA Best Practices
EDB Postgres DBA Best Practices
 
Table partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + RailsTable partitioning in PostgreSQL + Rails
Table partitioning in PostgreSQL + Rails
 
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
Michael Bayer Introduction to SQLAlchemy @ Postgres OpenMichael Bayer Introduction to SQLAlchemy @ Postgres Open
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
 
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
TPC-DSから学ぶPostgreSQLの弱点と今後の展望TPC-DSから学ぶPostgreSQLの弱点と今後の展望
TPC-DSから学ぶPostgreSQLの弱点と今後の展望
 
Leksioni 2
Leksioni 2Leksioni 2
Leksioni 2
 

Similar to Nori: The Official Elasticsearch Plugin for Korean Language Analysis

Oktavia Search Engine - pyconjp2014
Oktavia Search Engine - pyconjp2014Oktavia Search Engine - pyconjp2014
Oktavia Search Engine - pyconjp2014
Yoshiki Shibukawa
 

Similar to Nori: The Official Elasticsearch Plugin for Korean Language Analysis (20)

IWESEP 2013
IWESEP 2013IWESEP 2013
IWESEP 2013
 
Oktavia Search Engine - pyconjp2014
Oktavia Search Engine - pyconjp2014Oktavia Search Engine - pyconjp2014
Oktavia Search Engine - pyconjp2014
 
ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用ログ収集プラットフォーム開発におけるElasticsearchの運用
ログ収集プラットフォーム開発におけるElasticsearchの運用
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 

More from Elasticsearch

More from Elasticsearch (20)

An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
From MSP to MSSP using Elastic
From MSP to MSSP using ElasticFrom MSP to MSSP using Elastic
From MSP to MSSP using Elastic
 
Cómo crear excelentes experiencias de búsqueda en sitios web
Cómo crear excelentes experiencias de búsqueda en sitios webCómo crear excelentes experiencias de búsqueda en sitios web
Cómo crear excelentes experiencias de búsqueda en sitios web
 
Te damos la bienvenida a una nueva forma de realizar búsquedas
Te damos la bienvenida a una nueva forma de realizar búsquedas Te damos la bienvenida a una nueva forma de realizar búsquedas
Te damos la bienvenida a una nueva forma de realizar búsquedas
 
Tirez pleinement parti d'Elastic grâce à Elastic Cloud
Tirez pleinement parti d'Elastic grâce à Elastic CloudTirez pleinement parti d'Elastic grâce à Elastic Cloud
Tirez pleinement parti d'Elastic grâce à Elastic Cloud
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitables
 
Plongez au cœur de la recherche dans tous ses états.
Plongez au cœur de la recherche dans tous ses états.Plongez au cœur de la recherche dans tous ses états.
Plongez au cœur de la recherche dans tous ses états.
 
Modernising One Legal Se@rch with Elastic Enterprise Search [Customer Story]
Modernising One Legal Se@rch with Elastic Enterprise Search [Customer Story]Modernising One Legal Se@rch with Elastic Enterprise Search [Customer Story]
Modernising One Legal Se@rch with Elastic Enterprise Search [Customer Story]
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Welcome to a new state of find
Welcome to a new state of findWelcome to a new state of find
Welcome to a new state of find
 
Building great website search experiences
Building great website search experiencesBuilding great website search experiences
Building great website search experiences
 
Keynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified searchKeynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified search
 
Cómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisionesCómo transformar los datos en análisis con los que tomar decisiones
Cómo transformar los datos en análisis con los que tomar decisiones
 
Explore relève les défis Big Data avec Elastic Cloud
Explore relève les défis Big Data avec Elastic Cloud Explore relève les défis Big Data avec Elastic Cloud
Explore relève les défis Big Data avec Elastic Cloud
 
Comment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitablesComment transformer vos données en informations exploitables
Comment transformer vos données en informations exploitables
 
Transforming data into actionable insights
Transforming data into actionable insightsTransforming data into actionable insights
Transforming data into actionable insights
 
Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?Opening Keynote: Why Elastic?
Opening Keynote: Why Elastic?
 
Empowering agencies using Elastic as a Service inside Government
Empowering agencies using Elastic as a Service inside GovernmentEmpowering agencies using Elastic as a Service inside Government
Empowering agencies using Elastic as a Service inside Government
 
The opportunities and challenges of data for public good
The opportunities and challenges of data for public goodThe opportunities and challenges of data for public good
The opportunities and challenges of data for public good
 
Enterprise search and unstructured data with CGI and Elastic
Enterprise search and unstructured data with CGI and ElasticEnterprise search and unstructured data with CGI and Elastic
Enterprise search and unstructured data with CGI and Elastic
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Nori: The Official Elasticsearch Plugin for Korean Language Analysis

  • 1. Jim Ferenczi • Kiju Kim Feb 22, 2019 Nori: The Official Elasticsearch Plugin for Korean Language Analysis
  • 2. !2 Kiju Kim Sr. Support Engineer • “엘라스틱서치 6.2를 이용한 한국어, 중 국어, 일본어 검색” • “Elasticsearch Machine Learning을 이용한 다국어 로그 분류” • “어떤 한국어 분석기를 사용할까?” • S사 시스템 에어컨 다국어 입력기 • 신경망과 의존 문법을 이용한 구문 분석
  • 3. Jim Ferenczi Principal Software Engineer • Elasticsearch • Lucene PMC
  • 5. 세종대왕, 1443년 이미지 출처: https://ko.wikipedia.org/wiki/훈민정음_언해
  • 6. !6 Analyzers Elasticsearch has 40+ Language Analyzers PUT test/_doc/1 { “message": "나라의 말이 중 국과 달라" } { "token" : "나라의", … }, { "token" : "말이", … }, … GET test/_search { "query": { "match": { "message": "나라" } } } Original Text Standard Analyzer Query No Hit !
  • 7. !7 Open Source Korean Analyzers 나라의 말이 중국과 달라 => 나라, 말, 중국, 다르 • 유영호, 이용운 • License: Apache 2.0 • mecab-ko-dic 기반으로 만들어진 JVM 상에서 돌아가는 한국어 형태 소분석기입니다. • 기본적으로 Java와 Scala 인터페 이스를 제공합니다. • 이수명, 정호욱 • License: Apache 2.0 • 본 프로젝트는 Lucene KoreanAnalyzer를 개발하여 국 내에서 루씬의 활용도를 높이고자 합니다. • 유호현 • License: Apache 2.0 • 오픈소스 한국어 처리기 (Official Fork of twitter-korean-text) • 스칼라로 쓰여진 한국어 처리기입 니다. 현재 텍스트 정규화와 형태 소 분석, 스테밍을 지원하고 있습 니다. Seunjeon Arirang Open-korean-text
  • 8. !8 Blog Post on Korean Analyzers https://www.elastic.co/kr/blog/using-korean-analyzers
  • 9. !9 Find a Way to Properly Support Korean!
  • 10. !10 Journey to Nori Dec 2017 Dec 2017 Feb 2018 Apr 2018 Aug 2018 Surveyed existing open source Korean analyzers First POC of Nori Machine Learning started to support Korean (6.2) Merged the initial prototype of Nori into Lucene Announced Nori with Elasticsearch 6.4 !!! https://issues.apache.org/jira/browse/LUCENE-8231 Decided to develop a new Korean analyzer !
  • 13. !13 Nori, the genesis • MeCab • An open source text segmentation library • Language agnostic • Dictionaries for Japanese (ChaSen, IPADIC, …) • Lucene Kuromoji Analyzer • Based on MeCab and IPADIC • Implements the segmentation part of the MeCab library in Java • Viterbi
  • 14. !14 Nori, the genesis • mecab-ko-dic: • Created by Yongwoon Lee and Yungho Yu • Apache 2 License • A morphological dictionary for Korean language using MeCab • More than 800,000 entries: • 놀이,1781,3534,750,NNG,*,F,놀이,*,*,*,* • 3815 left ids, 2690 right ids (connection costs 3815*2690) • Used by Seunjeon • 200MB uncompressed
  • 15. !15 Nori, Binary Dictionary • Finite State Transducer (FST): • Prefix and infix compression for Hangul and Hanja • UTF-16 encoding • 5.4MB • Connections costs: • 3815 left ids, 2690 right ids (connection costs 3815*2690) • One short (16 bits) per cell • 20MB loaded in a direct byte buffer outside of the heap • Feature encoding: • Custom binary encoding • 9 bytes per entry (7MB total)
  • 16. !16 Nori, morphological analysis • Viterbi algorithm • Lattice: • Find the best possible segmentation • Backtrace when only one path is alive • Example: • 21세기 세종계획 • 21 + 세기 (century) + 세종 (Sejong) + 계획 (Plan)
  • 17. !17
  • 18. !18 Nori, user dictionary PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "nori_user_dict": { "type": "nori_tokenizer", "decompound_mode": "mixed", "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"] } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "nori_user_dict" } } } } } } Additional nouns (NNG)
  • 19. !19 Nori, custom dictionary • If your domain-specific vocabulary is big (several thousand), rebuilding the original dictionary with your extra rules can: • Lower the memory usage compared to the user-dic approach • Speed up the creation of the analyzer/tokenizer • Speed up the analysis • How to create a distribution that uses a custom dictionary: • https://github.com/jimczi/nori/blob/master/how-to-custom- dict.asciidoc
  • 20. !20 Additional filters • nori_part_of_speech token filter: • Removes tokens that match a set of part-of-speech tags • List of part of speech tags: • http://lucene.apache.org/core/7_7_0/analyzers-nori/org/apache/lucene/analysis/ko/ POS.Tag.html • Defaults to: "stoptags": [ "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV" ] Part of speech filter
  • 21. !21 Additional filters • nori_readingform token filter: • Example: 鄕歌 => 향가 (hyang-ga) Hanja to Hangul
  • 22. !22 Test the analysis GET _analyze { "tokenizer": "nori_tokenizer", "text": "뿌리가 깊은 나무는", "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"], "explain": true } Debug ?
  • 23. !23 Test the analysis { "detail": { "custom_analyzer": true, "charfilters": [], "tokenizer": { "name": "nori_tokenizer", "tokens": [ { "token": "뿌리", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0, "leftPOS": "NNG(General Noun)", "morphemes": null, "posType": "MORPHEME", "reading": null, "rightPOS": "NNG(General Noun)" }, … Debug ?
  • 24. !24 Nori, future • Keep N best segmentations • Upgrade to the latest mecab-ko-dic • Find longest token in the user dictionary • Community: • Lucene Jira: https://issues.apache.org/jira/ • Elasticsearch Github: https://github.com/elastic/elasticsearch • Your idea here ! New features and community