Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
An Introduction to NLP4L: 

Natural Language Processing Tool for Apache Lucene
Tomoko Uchida
Consultant, Rondhuit Co. Ltd.
3
Who am I
• Tomoko Uchida (@moco_beta)
• Luke (Lucene Toolbox) collaborator (2015 ~)
• https://github.com/DmitryKey/luke
...
4
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliterat...
5
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliterat...
6
What’s NLP4L?
• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• L...
7
What’s NLP4L?
• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• L...
8
What’s NLP4L?
• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• L...
9
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliterat...
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
10
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
11
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
12
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
13
Evaluation Measures
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
14
Recall ,Precision
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
15
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
16
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
17
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall
precision
recall , precision
Ranking Tuni...
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall
precision
re...
q=watch
20
targetresult
gradual precision improvement
filter by
Gender=Men s
21
targetresult
gradual precision improvement
22
targetresult
filter by
Gender=Men s
filter by
Price=100-150
gradual precision improvement
ID product price gender
1
CURREN New Men s Date Stainless Steel Military Sport
Quartz Wrist Watch
8.92 Men s
2 Suiksilver ...
ID article
1
David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory...
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following the Conservatives' general
...
26
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Translitera...
27
Language Model
• LM represents the fluency of language
28
Language Model
• LM represents the fluency of language
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
29
Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram...
Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper no...
32
Hidden Markov Model
33
Series of Words
Hidden Markov Model
34
Series of Part-of-Speech
Hidden Markov Model
35
Hidden Markov Model
36
Hidden Markov Model
NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
37
alice 0.2
apple 0.4
mike 0.2
orange 0.2
ate 0.333
is ...
38
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Translitera...
39
Transliteration
Transliteration is a process of transcribing letters
or words from one alphabet to another one to
facil...
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to ...
you search English mouse
41
It helps improve recall
but you got マウス (=mouse)
highlighted in Japanese
42
It helps improve recall
academy,アカデミー
accent,アクセント
access,アクセス
accident,アクシデント
acrobat,アクロバット
action,アクション
adapter,アダプター
africa,アフリカ
airbus,エアバス
a...
アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカca
エaアirバbuスs
アaラlaスsカka...
nlp4l> :load examples/trans_katakana_alpha.scala
45
Demo: Transliteration
val indexer = new HmmModelIndexer(index)
val file...
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッタ...
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance...
48
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Translitera...
49
NLP4L Framework
• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Re...
50
NLP4L Framework
• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Re...
51
NLP4L Framework
• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Re...
52
NLP4L Framework
• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Re...
53
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(aut...
54
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(aut...
55
Solr
ES
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・...
56
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto comple...
57
example: Keyword Attachment
Information about associated
Solr collection (core)
NLP/ML task (processor) chain
described...
58
example: Keyword Attachment
Extracted keywords from whole documents
ex.) Named Entities by OpenNLP
59
example: Keyword Attachment
Information about associated
Solr document (unique key, etc.)
Extracted keywords
from this ...
60
example: Keyword Attachment
Check the keywords and remove
wrong / inappropriate entries
61
example: Keyword Attachment
Synch (attach) all keywords to Solr documents
(by partial update command)
62
example: Keyword Attachment
Solr document (befere keywords are attached)
63
example: Keyword Attachment
Solr document (after keywords are attached)
64
example: Keyword Attachment
If you delete keyword(s) already have been
attached to solr documents,
the keyword(s) also ...
65
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
Keyword Attachment Application
• “Keyword attachment” is a general forma...
66
targetresult
1 2
3 …
50 100
500 …
Before Learning to Rank
67
targetresult
1 2
3 …
50 100
500 …
After Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the ...
69
targetresult
1 2
3 …
50 100
500 …
q=apple
computer …
Personalized Search
70
target
result
50 100
500 …
1 2
3 …
q=apple
fruit …
Personalized Search
71
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
Personalized Search
• Program learns, from access log and other sourc...
72
example: Generating Synonyms (loanwords)
Execute job that generate pairs of Katakana and
corresponding English words fr...
73
example: Generating Synonyms (loanwords)
Make adjustments in auto generated pairs
(candidate synonyms) via web UI
74
example: Generating Synonyms (loanwords)
acacia,アカシア
academy,アカデミー
acatenango,アカテナンゴ
access,アクセス
accident,アクシデント
action...
75
Contact us at
koji at apache dot org
for the details.
Join and Code with Us!
https://github.com/NLP4L
76
Thank you!
Upcoming SlideShare
Loading in …5
×

An Introduction to NLP4L

2,329 views

Published on

Presentation slides at Lucene/Solr Revolution 2015.

Published in: Technology

An Introduction to NLP4L

  1. 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  2. 2. An Introduction to NLP4L: 
 Natural Language Processing Tool for Apache Lucene Tomoko Uchida Consultant, Rondhuit Co. Ltd.
  3. 3. 3 Who am I • Tomoko Uchida (@moco_beta) • Luke (Lucene Toolbox) collaborator (2015 ~) • https://github.com/DmitryKey/luke • The best-known tool for debugging and learning Lucene/Solr, Elasticsearch index :-)
  4. 4. 4 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  5. 5. 5 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  6. 6. 6 What’s NLP4L? • GOAL • Improve Lucene users’ search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Apprications (e.g. Transliteration)
  7. 7. 7 What’s NLP4L? • GOAL • Improve Lucene users’ search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Apprications (e.g. Transliteration)
  8. 8. 8 What’s NLP4L? • GOAL • Improve Lucene users’ search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration)
  9. 9. 9 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  10. 10. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 10 Evaluation Measures
  11. 11. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 11 Evaluation Measures
  12. 12. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 12 Evaluation Measures
  13. 13. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 13 Evaluation Measures
  14. 14. Recall ,Precision tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 14
  15. 15. Recall ,Precision targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 15
  16. 16. n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 16 Solution
  17. 17. n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 17 Solution
  18. 18. n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) recall precision recall , precision Ranking Tuning 18 Solution
  19. 19. n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) e.g. Named Entity Extraction recall precision recall , precision Ranking Tuning 19 Solution
  20. 20. q=watch 20 targetresult gradual precision improvement
  21. 21. filter by Gender=Men s 21 targetresult gradual precision improvement
  22. 22. 22 targetresult filter by Gender=Men s filter by Price=100-150 gradual precision improvement
  23. 23. ID product price gender 1 CURREN New Men s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men s 2 Suiksilver The Gamer Watch 87.99 Men s 23 Structured Documents
  24. 24. ID article 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. 24 Unstructured Documents
  25. 25. I D article person org loc 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. David Cameron EU Brussels 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. EU UK Britain NEE[1] extracts interesting words. [1] Named Entity Extraction 25 Make them Structured
  26. 26. 26 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  27. 27. 27 Language Model • LM represents the fluency of language
  28. 28. 28 Language Model • LM represents the fluency of language
  29. 29. • LM represents the fluency of language • N-gram model is the LM which is most widely used 29 Language Model
  30. 30. • LM represents the fluency of language • N-gram model is the LM which is most widely used • Calculation example for 2-gram 30 totalTermFreq(”word2g”,”an apple”) totalTermFreq(”word”,”an”) Language Model
  31. 31. Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./. NNP Proper noun, singular VB Verb AT Article JJ Adjective . period 31 Our Corpus for training Part-of-Speech Tagging
  32. 32. 32 Hidden Markov Model
  33. 33. 33 Series of Words Hidden Markov Model
  34. 34. 34 Series of Part-of-Speech Hidden Markov Model
  35. 35. 35 Hidden Markov Model
  36. 36. 36 Hidden Markov Model
  37. 37. NNP 0.667 VB 0.0 . 0.0 JJ 0.0 AT 0.333 1.0 1.0 0.4 0.6 0.6670.333 37 alice 0.2 apple 0.4 mike 0.2 orange 0.2 ate 0.333 is 0.333 likes 0.333 an 1.0 red 1.0 . 1.0 HMM state diagram
  38. 38. 38 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  39. 39. 39 Transliteration Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers.
  40. 40. computer コンピューター server サーバー internet インターネット mouse マウス information インフォメーション examples of transliteration from English to Japanese 40 Transliteration
  41. 41. you search English mouse 41 It helps improve recall
  42. 42. but you got マウス (=mouse) highlighted in Japanese 42 It helps improve recall
  43. 43. academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー train_data/alpha_katakana.txt 43 Training data in NLP4L
  44. 44. アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt 44 academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー Training data in NLP4L
  45. 45. nlp4l> :load examples/trans_katakana_alpha.scala 45 Demo: Transliteration val indexer = new HmmModelIndexer(index) val file = Source.fromFile("train_data/alpha_katakana_aligned.txt", "UTF-8") val pattern: Regex = """([u30A0-u30FF]+)([a-zA-Z]+)(.*)""".r def align(result: List[(String, String)], str: String): List[(String, String)] = { str match { case pattern(a, b, c) => align(result :+ (a, b), c) case _ => result } } // create hmm model index file.getLines.foreach{ line: String => val doc = align(List.empty[(String, String)], line) indexer.addDocument(doc) }
  46. 46. Input Prediction Right Answer アルゴリズム algorism algorithm プログラム program (OK) ケミカル chaemmical chemical ダイニング dining (OK) コミッター committer (OK) エントリー entree entry 46 Demo: Transliteration
  47. 47. ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt 47 store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ Gathering loan words
  48. 48. 48 Agenda • What’s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon)
  49. 49. 49 NLP4L Framework • A framework that improves search experience (for mainly Lucene- based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
  50. 50. 50 NLP4L Framework • A framework that improves search experience (for mainly Lucene- based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
  51. 51. 51 NLP4L Framework • A framework that improves search experience (for mainly Lucene- based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
  52. 52. 52 NLP4L Framework • A framework that improves search experience (for mainly Lucene- based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well.
  53. 53. 53 Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search
  54. 54. 54 Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search
  55. 55. 55 Solr ES Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search Mahout Spark Model files Document Vectors Tagged Corpus
  56. 56. 56 Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search Solr ES
  57. 57. 57 example: Keyword Attachment Information about associated Solr collection (core) NLP/ML task (processor) chain described by HOCON (Human- Optimized Config Object Notation) UI prototype for NLP4L Framework (Lucia) https://github.com/NLP4L/lucia
  58. 58. 58 example: Keyword Attachment Extracted keywords from whole documents ex.) Named Entities by OpenNLP
  59. 59. 59 example: Keyword Attachment Information about associated Solr document (unique key, etc.) Extracted keywords from this document Solr field name for each keyword
  60. 60. 60 example: Keyword Attachment Check the keywords and remove wrong / inappropriate entries
  61. 61. 61 example: Keyword Attachment Synch (attach) all keywords to Solr documents (by partial update command)
  62. 62. 62 example: Keyword Attachment Solr document (befere keywords are attached)
  63. 63. 63 example: Keyword Attachment Solr document (after keywords are attached)
  64. 64. 64 example: Keyword Attachment If you delete keyword(s) already have been attached to solr documents, the keyword(s) also will be removed from solr index when next “synch” action executed.
  65. 65. 65 Lucene doc Lucene doc keyword ↑ Increase boost Keyword Attachment Application • “Keyword attachment” is a general format that enables the following functions. • Learning to Rank • Personalized Search • Named Entity Extraction • Document Classification
  66. 66. 66 targetresult 1 2 3 … 50 100 500 … Before Learning to Rank
  67. 67. 67 targetresult 1 2 3 … 50 100 500 … After Learning to Rank
  68. 68. • Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) 68 Lucene doc d q, q, … https://en.wikipedia.org/wiki/Learning_to_rank Learning to Rank
  69. 69. 69 targetresult 1 2 3 … 50 100 500 … q=apple computer … Personalized Search
  70. 70. 70 target result 50 100 500 … 1 2 3 … q=apple fruit … Personalized Search
  71. 71. 71 Lucene doc d1 q1u1, q2u2 Lucene doc d2 q2u1, q1u2 Personalized Search • Program learns, from access log and other sources, that the score of document d for a query q by user u should be larger than the normal score(q,d) • Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d). • Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous.
  72. 72. 72 example: Generating Synonyms (loanwords) Execute job that generate pairs of Katakana and corresponding English words from corpus
  73. 73. 73 example: Generating Synonyms (loanwords) Make adjustments in auto generated pairs (candidate synonyms) via web UI
  74. 74. 74 example: Generating Synonyms (loanwords) acacia,アカシア academy,アカデミー acatenango,アカテナンゴ access,アクセス accident,アクシデント action,アクション active,アクティブ activision,アクティビジョン acton,アクトン actor,アクター …… Exported pairs can be used in SynonymFilter synonyms_loadwords_ja.txt
  75. 75. 75 Contact us at koji at apache dot org for the details. Join and Code with Us! https://github.com/NLP4L
  76. 76. 76 Thank you!

×