Language Search

Language Search
ElasticSearch Boston Meetup - 3/27
Bryan Warner - Traackr

About me
● Bryan Warner - Developer @Traackr
○ bwarner@traackr.com

● I've worked with ElasticSearch since early 2012 ...
before that I had worked with Lucene & Solr

● Primary background is in Java back-end development

● Shifting focus into Scala development past year

About Traackr
● Influencer search engine

● We track content daily & in real-time for our database of
influential people

● We leverage ElasticSearch parent/child (top-children)
queries to search content (i.e. the children) to surface
the influencers who've authored it (i.e. the parents)

● Some of our back-end stack includes: ElasticSearch,
MongoDb, Java/Spring, Scala/Akka, etc.

Overview
● Indexing / Querying strategies to support language-
targeted searches within ES

● ES Analyzers / TokenFilters for language analysis

● Custom Analyzers / TokenFilters for ES

● Look at some OS projects that assist in language
detection & analysis

Use Case
● We have a database of articles written in many
languages

● We want our users to be able to search articles written
in a particular language

● We want that search to handle the nuances for that
particular language

Reference Schema
{
"settings" : {
"index": {
"number_of_shards" : 6, "number_of_replicas" : 1
},
"analysis":{
"analyzer": {}, "tokenizer": {}, "filter":{}
}
},
"mappings": {
"article": {
"text" : {"type" : "string", "analyzer":"standard", "store":true},
"author:" {"type" : "string", "analyzer":"simple", "store": true},
"date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
}
}
}

Indexing Strategies

Separate indices per language
- OR -
Same index for all languages

Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
○ IDF = log(numDocs/(docFreq+1)) + 1

CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
○ Same problem for Solr Joins
■ Maintain schema per index

Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine

CONS
■ Schema complexity grows
■ IDF values might be skewed

Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
a. At indexing time, we set the right mapping based on
the article's language

2. Create different fields per language-analyzed field
a. At indexing time, we populate the correct text field
based on the article's language

"mappings": {
"article_en": {
"text" : {"type" : "string", "analyzer":"english", "store":true},
"author:" {"type" : "string", "analyzer":"simple", "store": true}
},
"article_fr": {
"text" : {"type" : "string", "analyzer":"french", "store":true},
},
"article_de": {
"text" : {"type" : "string", "analyzer":"german", "store":true},
}
}

"mappings": {
"article": {
"text_en" : {"type" : "string", "analyzer":"english", "store":true},
"text_fr" : {"type" : "string", "analyzer":"french", "store":true},
"text_de" : {"type" : "string", "analyzer":"german", "store":true},
}
}

Querying Strategies
How do we execute a language-targeted search?

... all based on our indexing strategy.

Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
.setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...

Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes(targetMapping);

query.field("text");

...

Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes("article");

query.field(text_en|text_fr|text_de); // pick one

...

Querying Strategies
● Will these strategies support a multi-language search?
○ E.g. Search by french and german
○ E.g. Search against all languages

● Yes! *

● In the same SearchRequest:
○ We can search against multiple indices
○ We can search against multiple "mapping" types
○ We can search against multiple fields

* Need to give thought which query analyzer to use

Language Analysis
● What does ElasticSearch and/or Lucene offer us for
analyzing various languages?

● Is there a one-size-fits-all solution?
○ e.g. StandardAnalyzer

● Or do we need custom analyzers for each language?

Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
95% of the way there

● Each language analyzer provides its own flavor to the
StandardAnalyzer

● FrenchAnalyzer
○ Adds an ElisionFilter (l'avion -> avion)
○ Adds French StopWords filter
○ FrenchLightStemFilter

Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there

● German has a heavy use of compound words
■ das Vaterland => The fatherland
■ Rechtsanwaltskanzleien => Law Firms

● For best search results, these compound words should
produce index terms for their individual parts

● GermanAnalyzer lacks a Word Compound Token Filter

Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
get you far

● Using a Standard Tokenizer to extract tokens from
Chinese text will not produce accurate terms
○ Some 3rd-party Chinese analyzers will extract
bigrams from Chinese text and index those as if they
were words

● Need to do your research

Language Analysis
You should also know about...
● ASCII Folding Token Filter
○ über => uber

● ICU Analysis Plugin
○ http://www.elasticsearch.org/guide/reference/index-
modules/analysis/icu-plugin.html
○ Allows for unicode normalization, collation and
folding

Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
text (e.g. remove stemming)

● How do we go about doing this?
○ One way is to leverage ElasticSearch's flexible
schema definitions

Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer

Create a custom German analyzer in our schema:
"settings" : {
....
"analysis":{
"analyzer":{
"custom_text_german":{
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase"], stop words, german normalization?
}
}
....
}
}

1. Declare schema filter for german stop_words
2. We'll also need to create a custom TokenFilter class to wrap Lucene's org.
apache.lucene.analysis.de.GermanNormalizationFilter
a. It does not come as a pre-defined ES TokenFilter
b. German text needs to normalize on certain characters based .. e.g.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.

3. Declare schema filter for custom GermanNormalizationFilter

package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
@Inject
public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings,
@Assisted String name, @Assisted Settings settings) {
super(index, indexSettings, name, settings);
}
@Override
public TokenStream create(TokenStream tokenStream) {
return new GermanNormalizationFilter(tokenStream);
}
}

Define new token filters in our schema:
"settings" : {
"analysis":{
....
"filter":{
"german_normalization":{
"type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory"
},
"german_stop":{
"type":"stop",
"stopwords":["_german_"],
"enable_position_increments":"true"
}
}
....

Create a custom German analyzer:
"settings" : {
....
"analysis":{
"analyzer":{
"custom_text_german":{
"type":"custom",
"tokenizer": "standard",
"filter": ["german_normalization", "standard", "lowercase", "german_stop"],
}
}
....
}
}

OS Projects
Language Detection
● https://code.google.com/p/language-detection/
○ Written in Java
○ Provides language profiles with unigram, bigram, and trigram
character frequencies
○ Detector provides accuracy % for each language detected

PROS
■ Very fast (~4k pieces of text per second)
■ Very reliable for text greater than 30-40 characters

CONS
■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
short tweets

OS Projects
German Word Decompounder
● https://github.com/jprante/elasticsearch-analysis-decompound

● Lucene offers two compound word token filters, a dictionary- &
hyphenation-based variant
○ Not bundled with Lucene due to licensing issues
○ Require loading a word list in memory before they are run

● The decompounder uses prebuilt Compact Patricia Tries for efficient word
segmentation provided by the ASV toolbox
○ ASV Toolbox project - http://wortschatz.uni-leipzig.
de/~cbiemann/software/toolbox/index.htm

Language Search

Recommended

Recommended

More Related Content

Similar to Language Search

Similar to Language Search (20)

Recently uploaded

Recently uploaded (20)

Language Search