Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
An Introduction to
NLP4L
Natural Language Processing tool for
Apache Lucene
Koji Sekiguchi @kojisays
Founder & CEO, RONDHU...
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
What s NLP4L?
4
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Luce...
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Luce...
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Luce...
What s Lucene?
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
Alice ate an apple.
Mike likes an or...
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
Evaluation Measures
10
Evaluation Measures
target
11
Evaluation Measures
targetresult
12
Evaluation Measures
targetresult
tpfp fn
tn
13
Evaluation Measures
targetresult
positive
14
Evaluation Measures
negative
15
result
Evaluation Measures
16
true
positive
true negative
Evaluation Measures
targetresult
17
false
positive
false
negative
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
18
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
19
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
Recall ,Precision
20
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
21
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
22
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall
precision
recall , precision
Ran...
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall
pre...
gradual precision improvement
q=watch
25
targetresult
filter by
Gender=Men s
26
targetresult
gradual precision improvement
27
targetresult
filter by
Gender=Men s
filter by
Price=100-150
gradual precision improvement
Structured Documents
ID product price gender
1
CURREN New Men s Date Stainless
Steel Military Sport Quartz Wrist Watch
8.9...
Unstructured Documents
ID article
1
David Cameron says he has a mandate to pursue EU reform following the
Conservatives' g...
Make them Structured
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following
the Co...
Manual Tagging using brat
31
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
A small Corpus
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
33
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
d...
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
43
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
r...
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
r...
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
r...
Getting top terms
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
rea...
What s ShingleFilter?
• ShingleFilter = Word n-gram TokenFilter
WhitespaceTokenizer
ShingleFilter (N=2)
“Lucene is a popul...
Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
• Calculation exa...
val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def s...
val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def s...
val reader = RawReader(index)
// P(apple|an) = C(an apple) / C(an)
val count_an_apple = reader.totalTermFreq("word2g", "an...
Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper no...
Hidden Markov Model
54
Hidden Markov Model
55
Series of Words
Hidden Markov Model
56
Series of Part-of-Speech
Hidden Markov Model
57
Hidden Markov Model
58
HMM state diagram
NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
59
alice 0.2
apple 0.4
mike 0.2
orange...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/N...
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
Transliteration
Transliteration is a process of transcribing letters or words from one
alphabet to another one to facilita...
It helps improve recall
you search English mouse
70
It helps improve recall
but you got マウス (=mouse)
highlighted in Japanese
71
Training data in NLP4L
アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカc...
Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
...
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
ca...
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
ca...
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transli...
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Refere...
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Refere...
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Refere...
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Refere...
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto c...
Keyword Attachment
• Keyword attachment is a general format that enables the
following functions.
• Learning to Rank
• Per...
Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be...
Personalized Search
• Program learns, from access log and other sources, that
the score of document d for a query q by use...
Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
85
Demo or
Q & A
Thank you!
86
Upcoming SlideShare
Loading in …5
×

An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

1,598 views

Published on

NLP4L slide for Scala by the Bay / Big Data Scala 2015

Published in: Technology
  • Be the first to comment

An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

  1. 1. An Introduction to NLP4L Natural Language Processing tool for Apache Lucene Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT
  2. 2. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 2
  3. 3. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 3
  4. 4. What s NLP4L? 4
  5. 5. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 5
  6. 6. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 6
  7. 7. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 7
  8. 8. What s Lucene? alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 Alice ate an apple. Mike likes an orange. An apple is red. 1: 2: 3: indexing apple searching (inverted) index Lucene is a high-performance, full-featured text search engine library written entirely in Java. 8
  9. 9. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 9
  10. 10. Evaluation Measures 10
  11. 11. Evaluation Measures target 11
  12. 12. Evaluation Measures targetresult 12
  13. 13. Evaluation Measures targetresult tpfp fn tn 13
  14. 14. Evaluation Measures targetresult positive 14
  15. 15. Evaluation Measures negative 15 result
  16. 16. Evaluation Measures 16 true positive true negative
  17. 17. Evaluation Measures targetresult 17 false positive false negative
  18. 18. Evaluation Measures targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 18
  19. 19. Recall ,Precision tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 19
  20. 20. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) Recall ,Precision 20
  21. 21. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 21
  22. 22. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 22
  23. 23. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) recall precision recall , precision Ranking Tuning 23
  24. 24. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) e.g. Named Entity Extraction recall precision recall , precision Ranking Tuning 24
  25. 25. gradual precision improvement q=watch 25 targetresult
  26. 26. filter by Gender=Men s 26 targetresult gradual precision improvement
  27. 27. 27 targetresult filter by Gender=Men s filter by Price=100-150 gradual precision improvement
  28. 28. Structured Documents ID product price gender 1 CURREN New Men s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men s 2 Suiksilver The Gamer Watch 87.99 Men s 28
  29. 29. Unstructured Documents ID article 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. 29
  30. 30. Make them Structured I D article person org loc 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. David Cameron EU Bruss els 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. EU UK Britai n NEE[1] extracts interesting words. [1] Named Entity Extraction 30
  31. 31. Manual Tagging using brat 31
  32. 32. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 32
  33. 33. A small Corpus val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) 33
  34. 34. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 34
  35. 35. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 35 text data
  36. 36. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 36 Lucene index directory
  37. 37. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 37 schema definition
  38. 38. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 38 create Lucene document
  39. 39. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 39 open a writer
  40. 40. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 40 write documents
  41. 41. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 41 close writer
  42. 42. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 42 As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818
  43. 43. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 43
  44. 44. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 44
  45. 45. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 45
  46. 46. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 46
  47. 47. Getting top terms alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1) reader.close getting_word_counts.scala 47
  48. 48. What s ShingleFilter? • ShingleFilter = Word n-gram TokenFilter WhitespaceTokenizer ShingleFilter (N=2) “Lucene is a popular software” Lucene/is/a/popular/software Lucene is/is a/a popular/ popular software 48
  49. 49. Language Model • LM represents the fluency of language • N-gram model is the LM which is most widely used • Calculation example for 2-gram 49
  50. 50. val index = "/tmp/index-lm" val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } // create a language model index val writer = IWriter(index, schema()) def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) } CORPUS.foreach(addDocument(_)) writer.close() language_model.scala 1/2 50
  51. 51. val index = "/tmp/index-lm" val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } // create a language model index val writer = IWriter(index, schema()) def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) } CORPUS.foreach(addDocument(_)) writer.close() language_model.scala 1/2 51 schema definition
  52. 52. val reader = RawReader(index) // P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat // P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat reader.close language_model.scala 2/2 52
  53. 53. Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./. NNP Proper noun, singular VB Verb AT Article JJ Adjective . period Part-of-Speech Tagging 53 Our Corpus for training
  54. 54. Hidden Markov Model 54
  55. 55. Hidden Markov Model 55 Series of Words
  56. 56. Hidden Markov Model 56 Series of Part-of-Speech
  57. 57. Hidden Markov Model 57
  58. 58. Hidden Markov Model 58
  59. 59. HMM state diagram NNP 0.667 VB 0.0 . 0.0 JJ 0.0 AT 0.333 1.0 1.0 0.4 0.6 0.6670.333 59 alice 0.2 apple 0.4 mike 0.2 orange 0.2 ate 0.333 is 0.333 likes 0.333 an 1.0 red 1.0 . 1.0
  60. 60. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 60
  61. 61. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 61 text data (they are tagged!)
  62. 62. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 62 write-open Lucene index
  63. 63. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 63 tagged texts are indexed here
  64. 64. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 64 make an HmmModel from Lucene index
  65. 65. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 65 get HmmTagger from HmmModel
  66. 66. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 66 use HmmTagger to annotate unknown sentence
  67. 67. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training. 67
  68. 68. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 68
  69. 69. Transliteration Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers. computer コンピューター server サーバー internet インターネット mouse マウス information インフォメーション examples of transliteration from English to Japanese 69
  70. 70. It helps improve recall you search English mouse 70
  71. 71. It helps improve recall but you got マウス (=mouse) highlighted in Japanese 71
  72. 72. Training data in NLP4L アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt 72 academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー
  73. 73. Demo: Transliteration Input Prediction Right Answer アルゴリズム algorism algorithm プログラム program (OK) ケミカル chaemmical chemical ダイニング dining (OK) コミッター committer (OK) エントリー entree entry nlp4l> :load examples/trans_katakana_alpha.scala 73
  74. 74. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt 74 store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥
  75. 75. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt 75 store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ Got 1,800+ records of synonym knowledge from jawiki
  76. 76. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 76
  77. 77. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 77
  78. 78. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 78
  79. 79. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 79
  80. 80. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 80
  81. 81. Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search 81
  82. 82. Keyword Attachment • Keyword attachment is a general format that enables the following functions. • Learning to Rank • Personalized Search • Named Entity Extraction • Document Classification Lucene doc Lucene doc keyword ↑ Increase boost 82
  83. 83. Learning to Rank • Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, … https://en.wikipedia.org/wiki/Learning_to_rank 83
  84. 84. Personalized Search • Program learns, from access log and other sources, that the score of document d for a query q by user u should be larger than the normal score(q,d) • Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d). • Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous. Lucene doc d1 q1u1, q2u2 Lucene doc d2 q2u1, q1u2 84
  85. 85. Join and Code with Us! Contact us at koji at apache dot org for the details. 85
  86. 86. Demo or Q & A Thank you! 86

×