Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Language Search

Presentation for Boston ElasticSearch Meetup Group 3/27

  • Be the first to comment

Language Search

  1. 1. Language SearchElasticSearch Boston Meetup - 3/27 Bryan Warner - Traackr
  2. 2. About me● Bryan Warner - Developer @Traackr ○● Ive worked with ElasticSearch since early 2012 ... before that I had worked with Lucene & Solr● Primary background is in Java back-end development● Shifting focus into Scala development past year
  3. 3. About Traackr● Influencer search engine● We track content daily & in real-time for our database of influential people● We leverage ElasticSearch parent/child (top-children) queries to search content (i.e. the children) to surface the influencers whove authored it (i.e. the parents)● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc.
  4. 4. Overview● Indexing / Querying strategies to support language- targeted searches within ES● ES Analyzers / TokenFilters for language analysis● Custom Analyzers / TokenFilters for ES● Look at some OS projects that assist in language detection & analysis
  5. 5. Use Case● We have a database of articles written in many languages● We want our users to be able to search articles written in a particular language● We want that search to handle the nuances for that particular language
  6. 6. Reference Schema{ "settings" : { "index": { "number_of_shards" : 6, "number_of_replicas" : 1 }, "analysis":{ "analyzer": {}, "tokenizer": {}, "filter":{} } }, "mappings": { "article": { "text" : {"type" : "string", "analyzer":"standard", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true}, "date": {"type" : "date", "format" : "yyyy-MM-ddTHH:mm:ssZ", "store":true} } }}
  7. 7. Indexing Strategies Separate indices per language - OR - Same index for all languages
  8. 8. Indexing StrategiesSeparate Indices per languagePROS■ Clean separation■ Truer IDF values ○ IDF = log(numDocs/(docFreq+1)) + 1CONS■ Increased Overhead■ Parent/Child queries -> parent document duplication ○ Same problem for Solr Joins■ Maintain schema per index
  9. 9. Indexing StrategiesSame index for all languagesPROS■ One index to maintain (and one schema)■ Parent/Child queries are fineCONS■ Schema complexity grows■ IDF values might be skewed
  10. 10. Indexing StrategiesSame index for all languages ... how?1. Create different "mapping" types per language a. At indexing time, we set the right mapping based on the articles language2. Create different fields per language-analyzed field a. At indexing time, we populate the correct text field based on the articles language
  11. 11. "mappings": { "article_en": { "text" : {"type" : "string", "analyzer":"english", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-ddTHH:mm:ssZ", "store":true} }, "article_fr": { "text" : {"type" : "string", "analyzer":"french", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-ddTHH:mm:ssZ", "store":true} }, "article_de": { "text" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-ddTHH:mm:ssZ", "store":true} }}
  12. 12. "mappings": { "article": { "text_en" : {"type" : "string", "analyzer":"english", "store":true}, "text_fr" : {"type" : "string", "analyzer":"french", "store":true}, "text_de" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-ddTHH:mm:ssZ", "store":true} }}
  13. 13. Querying StrategiesHow do we execute a language-targeted search?... all based on our indexing strategy.
  14. 14. Querying Strategies(1) Separate Indices per language...String targetIndex = getIndexForLanguage(languageParam);SearchRequestBuilder request = client.prepareSearch(targetIndex) .setTypes("article");QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch");query.field("text");query.analyzer(english|french|german); // pick onerequest.setQuery(query);SearchResponse searchResponse = request.execute().actionGet();...
  15. 15. Querying Strategies(2a) Same index for language - Diff. mappings...String targetMapping = getMappingForLanguage(languageParam);SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes(targetMapping);QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch");query.field("text");query.analyzer(english|french|german); // pick onerequest.setQuery(query);SearchResponse searchResponse = request.execute().actionGet();...
  16. 16. Querying Strategies(2b) Same index for language - Diff. fields...SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes("article");QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch");query.field(text_en|text_fr|text_de); // pick onequery.analyzer(english|french|german); // pick onerequest.setQuery(query);SearchResponse searchResponse = request.execute().actionGet();...
  17. 17. Querying Strategies● Will these strategies support a multi-language search? ○ E.g. Search by french and german ○ E.g. Search against all languages● Yes! *● In the same SearchRequest: ○ We can search against multiple indices ○ We can search against multiple "mapping" types ○ We can search against multiple fields* Need to give thought which query analyzer to use
  18. 18. Language Analysis● What does ElasticSearch and/or Lucene offer us for analyzing various languages?● Is there a one-size-fits-all solution? ○ e.g. StandardAnalyzer● Or do we need custom analyzers for each language?
  19. 19. Language AnalysisStandardAnalyzer - The Good● For many languages (french, spanish), it will get you 95% of the way there● Each language analyzer provides its own flavor to the StandardAnalyzer● FrenchAnalyzer ○ Adds an ElisionFilter (lavion -> avion) ○ Adds French StopWords filter ○ FrenchLightStemFilter
  20. 20. Language AnalysisStandardAnalyzer - The Bad● For some languages, it will get you 2/3 of the way there● German has a heavy use of compound words ■ das Vaterland => The fatherland ■ Rechtsanwaltskanzleien => Law Firms● For best search results, these compound words should produce index terms for their individual parts● GermanAnalyzer lacks a Word Compound Token Filter
  21. 21. Language AnalysisStandardAnalyzer - The Ugly● For other languages (e.g. Asian languages), it will not get you far● Using a Standard Tokenizer to extract tokens from Chinese text will not produce accurate terms ○ Some 3rd-party Chinese analyzers will extract bigrams from Chinese text and index those as if they were words● Need to do your research
  22. 22. Language AnalysisYou should also know about...● ASCII Folding Token Filter ○ über => uber● ICU Analysis Plugin ○ modules/analysis/icu-plugin.html ○ Allows for unicode normalization, collation and folding
  23. 23. Custom Analyzer / Token Filter● Lets create a custom analyzer definition for German text (e.g. remove stemming)● How do we go about doing this? ○ One way is to leverage ElasticSearchs flexible schema definitions
  24. 24. Lucene 3.6 -
  25. 25. Custom Analyzer / Token FilterCreate a custom German analyzer in our schema:"settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase"], stop words, german normalization? } } .... }}
  26. 26. Custom Analyzer / Token Filter1. Declare schema filter for german stop_words2. Well also need to create a custom TokenFilter class to wrap Lucenes org. a. It does not come as a pre-defined ES TokenFilter b. German text needs to normalize on certain characters based .. e.g. ae and oe are replaced by a, and o, respectively.3. Declare schema filter for custom GermanNormalizationFilter
  27. 27. package org.elasticsearch.index.analysis;import org.apache.lucene.analysis.TokenStream;import;import org.elasticsearch.common.inject.Inject;import org.elasticsearch.common.inject.assistedinject.Assisted;import org.elasticsearch.common.settings.Settings;import org.elasticsearch.index.Index;import org.elasticsearch.index.settings.IndexSettings;public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory { @Inject public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new GermanNormalizationFilter(tokenStream); }}
  28. 28. Custom Analyzer / Token FilterDefine new token filters in our schema:"settings" : { "analysis":{ .... "filter":{ "german_normalization":{ "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory" }, "german_stop":{ "type":"stop", "stopwords":["_german_"], "enable_position_increments":"true" } }....
  29. 29. Custom Analyzer / Token FilterCreate a custom German analyzer:"settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type":"custom", "tokenizer": "standard", "filter": ["german_normalization", "standard", "lowercase", "german_stop"], } } .... }}
  30. 30. OS ProjectsLanguage Detection● ○ Written in Java ○ Provides language profiles with unigram, bigram, and trigram character frequencies ○ Detector provides accuracy % for each language detectedPROS ■ Very fast (~4k pieces of text per second) ■ Very reliable for text greater than 30-40 charactersCONS ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e. short tweets
  31. 31. OS ProjectsGerman Word Decompounder●● Lucene offers two compound word token filters, a dictionary- & hyphenation-based variant ○ Not bundled with Lucene due to licensing issues ○ Require loading a word list in memory before they are run● The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox ○ ASV Toolbox project - http://wortschatz.uni-leipzig. de/~cbiemann/software/toolbox/index.htm