SlideShare a Scribd company logo
Language Search
ElasticSearch Boston Meetup - 3/27
       Bryan Warner - Traackr
About me
● Bryan Warner - Developer @Traackr
  ○ bwarner@traackr.com

● I've worked with ElasticSearch since early 2012 ...
  before that I had worked with Lucene & Solr

● Primary background is in Java back-end development

● Shifting focus into Scala development past year
About Traackr
● Influencer search engine

● We track content daily & in real-time for our database of
  influential people

● We leverage ElasticSearch parent/child (top-children)
  queries to search content (i.e. the children) to surface
  the influencers who've authored it (i.e. the parents)

● Some of our back-end stack includes: ElasticSearch,
  MongoDb, Java/Spring, Scala/Akka, etc.
Overview
● Indexing / Querying strategies to support language-
  targeted searches within ES

● ES Analyzers / TokenFilters for language analysis

● Custom Analyzers / TokenFilters for ES

● Look at some OS projects that assist in language
  detection & analysis
Use Case
● We have a database of articles written in many
  languages

● We want our users to be able to search articles written
  in a particular language

● We want that search to handle the nuances for that
  particular language
Reference Schema
{
    "settings" : {
      "index": {
        "number_of_shards" : 6, "number_of_replicas" : 1
      },
      "analysis":{
        "analyzer": {}, "tokenizer": {}, "filter":{}
      }
    },
    "mappings": {
      "article": {
        "text" : {"type" : "string", "analyzer":"standard", "store":true},
        "author:" {"type" : "string", "analyzer":"simple", "store": true},
        "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
      }
    }
}
Indexing Strategies



      Separate indices per language
                  - OR -
       Same index for all languages
Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
  ○ IDF = log(numDocs/(docFreq+1)) + 1

CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
   ○ Same problem for Solr Joins
■ Maintain schema per index
Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine

CONS
■ Schema complexity grows
■ IDF values might be skewed
Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
   a. At indexing time, we set the right mapping based on
      the article's language

2. Create different fields per language-analyzed field
   a. At indexing time, we populate the correct text field
      based on the article's language
"mappings": {
  "article_en": {
    "text" : {"type" : "string", "analyzer":"english", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_fr": {
    "text" : {"type" : "string", "analyzer":"french", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_de": {
    "text" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
"mappings": {
  "article": {
    "text_en" : {"type" : "string", "analyzer":"english", "store":true},
    "text_fr" : {"type" : "string", "analyzer":"french", "store":true},
    "text_de" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
Querying Strategies
How do we execute a language-targeted search?

... all based on our indexing strategy.
Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
       .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
       .setTypes(targetMapping);

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
     .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
● Will these strategies support a multi-language search?
  ○ E.g. Search by french and german
  ○ E.g. Search against all languages

● Yes! *

● In the same SearchRequest:
   ○ We can search against multiple indices
   ○ We can search against multiple "mapping" types
   ○ We can search against multiple fields

* Need to give thought which query analyzer to use
Language Analysis
● What does ElasticSearch and/or Lucene offer us for
  analyzing various languages?

● Is there a one-size-fits-all solution?
   ○ e.g. StandardAnalyzer

● Or do we need custom analyzers for each language?
Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
  95% of the way there

● Each language analyzer provides its own flavor to the
  StandardAnalyzer

● FrenchAnalyzer
  ○ Adds an ElisionFilter (l'avion -> avion)
  ○ Adds French StopWords filter
  ○ FrenchLightStemFilter
Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there

● German has a heavy use of compound words
     ■ das Vaterland => The fatherland
     ■ Rechtsanwaltskanzleien => Law Firms

● For best search results, these compound words should
  produce index terms for their individual parts

● GermanAnalyzer lacks a Word Compound Token Filter
Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
  get you far

● Using a Standard Tokenizer to extract tokens from
  Chinese text will not produce accurate terms
  ○ Some 3rd-party Chinese analyzers will extract
     bigrams from Chinese text and index those as if they
     were words

● Need to do your research
Language Analysis
You should also know about...
● ASCII Folding Token Filter
  ○ über => uber

● ICU Analysis Plugin
   ○ http://www.elasticsearch.org/guide/reference/index-
     modules/analysis/icu-plugin.html
   ○ Allows for unicode normalization, collation and
     folding
Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
  text (e.g. remove stemming)

● How do we go about doing this?
   ○ One way is to leverage ElasticSearch's flexible
     schema definitions
Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type": "custom",
           "tokenizer": "standard",
           "filter": ["standard", "lowercase"], stop words, german normalization?
       }
    }
    ....
  }
}
Custom Analyzer / Token Filter
1.   Declare schema filter for german stop_words
2.   We'll also need to create a custom TokenFilter class to wrap Lucene's org.
     apache.lucene.analysis.de.GermanNormalizationFilter
     a.   It does not come as a pre-defined ES TokenFilter
     b.   German text needs to normalize on certain characters based .. e.g.
          'ae' and 'oe' are replaced by 'a', and 'o', respectively.

3.   Declare schema filter for custom GermanNormalizationFilter
package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
  @Inject
  public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings,
           @Assisted String name, @Assisted Settings settings) {
     super(index, indexSettings, name, settings);
  }
  @Override
  public TokenStream create(TokenStream tokenStream) {
     return new GermanNormalizationFilter(tokenStream);
  }
}
Custom Analyzer / Token Filter
Define new token filters in our schema:
"settings" : {
  "analysis":{
     ....
     "filter":{
       "german_normalization":{
          "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory"
       },
       "german_stop":{
          "type":"stop",
          "stopwords":["_german_"],
          "enable_position_increments":"true"
       }
     }
....
Custom Analyzer / Token Filter
Create a custom German analyzer:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type":"custom",
           "tokenizer": "standard",
           "filter": ["german_normalization", "standard", "lowercase", "german_stop"],
       }
    }
    ....
  }
}
OS Projects
Language Detection
●   https://code.google.com/p/language-detection/
     ○ Written in Java
     ○ Provides language profiles with unigram, bigram, and trigram
         character frequencies
     ○ Detector provides accuracy % for each language detected

PROS
 ■ Very fast (~4k pieces of text per second)
 ■ Very reliable for text greater than 30-40 characters

CONS
 ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
   short tweets
OS Projects
German Word Decompounder
●   https://github.com/jprante/elasticsearch-analysis-decompound

●   Lucene offers two compound word token filters, a dictionary- &
    hyphenation-based variant
     ○ Not bundled with Lucene due to licensing issues
     ○ Require loading a word list in memory before they are run

●   The decompounder uses prebuilt Compact Patricia Tries for efficient word
    segmentation provided by the ASV toolbox
     ○ ASV Toolbox project - http://wortschatz.uni-leipzig.
        de/~cbiemann/software/toolbox/index.htm

More Related Content

Similar to Language Search

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
Inexture Solutions
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
OpenThink Labs
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
Volodymyr Kraietskyi
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
Siddharth Teotia
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
Taymoor Nazmy
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
andrefsantos
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Sperasoft
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
Shifa Khan
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
Chase Tingley
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
Sebastiano Panichella
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Alexandre Rafalovitch
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
Matteo Moci
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in Go
Takuya Ueda
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
pascaldimassimo
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
Woonsan Ko
 
Ts archiving
Ts   archivingTs   archiving
Ts archivingConfiz
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 

Similar to Language Search (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in Go
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Language Search

  • 1. Language Search ElasticSearch Boston Meetup - 3/27 Bryan Warner - Traackr
  • 2. About me ● Bryan Warner - Developer @Traackr ○ bwarner@traackr.com ● I've worked with ElasticSearch since early 2012 ... before that I had worked with Lucene & Solr ● Primary background is in Java back-end development ● Shifting focus into Scala development past year
  • 3. About Traackr ● Influencer search engine ● We track content daily & in real-time for our database of influential people ● We leverage ElasticSearch parent/child (top-children) queries to search content (i.e. the children) to surface the influencers who've authored it (i.e. the parents) ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc.
  • 4. Overview ● Indexing / Querying strategies to support language- targeted searches within ES ● ES Analyzers / TokenFilters for language analysis ● Custom Analyzers / TokenFilters for ES ● Look at some OS projects that assist in language detection & analysis
  • 5. Use Case ● We have a database of articles written in many languages ● We want our users to be able to search articles written in a particular language ● We want that search to handle the nuances for that particular language
  • 6. Reference Schema { "settings" : { "index": { "number_of_shards" : 6, "number_of_replicas" : 1 }, "analysis":{ "analyzer": {}, "tokenizer": {}, "filter":{} } }, "mappings": { "article": { "text" : {"type" : "string", "analyzer":"standard", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true}, "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } } }
  • 7. Indexing Strategies Separate indices per language - OR - Same index for all languages
  • 8. Indexing Strategies Separate Indices per language PROS ■ Clean separation ■ Truer IDF values ○ IDF = log(numDocs/(docFreq+1)) + 1 CONS ■ Increased Overhead ■ Parent/Child queries -> parent document duplication ○ Same problem for Solr Joins ■ Maintain schema per index
  • 9. Indexing Strategies Same index for all languages PROS ■ One index to maintain (and one schema) ■ Parent/Child queries are fine CONS ■ Schema complexity grows ■ IDF values might be skewed
  • 10. Indexing Strategies Same index for all languages ... how? 1. Create different "mapping" types per language a. At indexing time, we set the right mapping based on the article's language 2. Create different fields per language-analyzed field a. At indexing time, we populate the correct text field based on the article's language
  • 11. "mappings": { "article_en": { "text" : {"type" : "string", "analyzer":"english", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_fr": { "text" : {"type" : "string", "analyzer":"french", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_de": { "text" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 12. "mappings": { "article": { "text_en" : {"type" : "string", "analyzer":"english", "store":true}, "text_fr" : {"type" : "string", "analyzer":"french", "store":true}, "text_de" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 13. Querying Strategies How do we execute a language-targeted search? ... all based on our indexing strategy.
  • 14. Querying Strategies (1) Separate Indices per language ... String targetIndex = getIndexForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch(targetIndex) .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 15. Querying Strategies (2a) Same index for language - Diff. mappings ... String targetMapping = getMappingForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes(targetMapping); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 16. Querying Strategies (2b) Same index for language - Diff. fields ... SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field(text_en|text_fr|text_de); // pick one query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 17. Querying Strategies ● Will these strategies support a multi-language search? ○ E.g. Search by french and german ○ E.g. Search against all languages ● Yes! * ● In the same SearchRequest: ○ We can search against multiple indices ○ We can search against multiple "mapping" types ○ We can search against multiple fields * Need to give thought which query analyzer to use
  • 18. Language Analysis ● What does ElasticSearch and/or Lucene offer us for analyzing various languages? ● Is there a one-size-fits-all solution? ○ e.g. StandardAnalyzer ● Or do we need custom analyzers for each language?
  • 19. Language Analysis StandardAnalyzer - The Good ● For many languages (french, spanish), it will get you 95% of the way there ● Each language analyzer provides its own flavor to the StandardAnalyzer ● FrenchAnalyzer ○ Adds an ElisionFilter (l'avion -> avion) ○ Adds French StopWords filter ○ FrenchLightStemFilter
  • 20. Language Analysis StandardAnalyzer - The Bad ● For some languages, it will get you 2/3 of the way there ● German has a heavy use of compound words ■ das Vaterland => The fatherland ■ Rechtsanwaltskanzleien => Law Firms ● For best search results, these compound words should produce index terms for their individual parts ● GermanAnalyzer lacks a Word Compound Token Filter
  • 21. Language Analysis StandardAnalyzer - The Ugly ● For other languages (e.g. Asian languages), it will not get you far ● Using a Standard Tokenizer to extract tokens from Chinese text will not produce accurate terms ○ Some 3rd-party Chinese analyzers will extract bigrams from Chinese text and index those as if they were words ● Need to do your research
  • 22. Language Analysis You should also know about... ● ASCII Folding Token Filter ○ über => uber ● ICU Analysis Plugin ○ http://www.elasticsearch.org/guide/reference/index- modules/analysis/icu-plugin.html ○ Allows for unicode normalization, collation and folding
  • 23. Custom Analyzer / Token Filter ● Let's create a custom analyzer definition for German text (e.g. remove stemming) ● How do we go about doing this? ○ One way is to leverage ElasticSearch's flexible schema definitions
  • 24. Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
  • 25. Custom Analyzer / Token Filter Create a custom German analyzer in our schema: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase"], stop words, german normalization? } } .... } }
  • 26. Custom Analyzer / Token Filter 1. Declare schema filter for german stop_words 2. We'll also need to create a custom TokenFilter class to wrap Lucene's org. apache.lucene.analysis.de.GermanNormalizationFilter a. It does not come as a pre-defined ES TokenFilter b. German text needs to normalize on certain characters based .. e.g. 'ae' and 'oe' are replaced by 'a', and 'o', respectively. 3. Declare schema filter for custom GermanNormalizationFilter
  • 27. package org.elasticsearch.index.analysis; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.de.GermanNormalizationFilter; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.inject.assistedinject.Assisted; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.Index; import org.elasticsearch.index.settings.IndexSettings; public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory { @Inject public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new GermanNormalizationFilter(tokenStream); } }
  • 28. Custom Analyzer / Token Filter Define new token filters in our schema: "settings" : { "analysis":{ .... "filter":{ "german_normalization":{ "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory" }, "german_stop":{ "type":"stop", "stopwords":["_german_"], "enable_position_increments":"true" } } ....
  • 29. Custom Analyzer / Token Filter Create a custom German analyzer: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type":"custom", "tokenizer": "standard", "filter": ["german_normalization", "standard", "lowercase", "german_stop"], } } .... } }
  • 30. OS Projects Language Detection ● https://code.google.com/p/language-detection/ ○ Written in Java ○ Provides language profiles with unigram, bigram, and trigram character frequencies ○ Detector provides accuracy % for each language detected PROS ■ Very fast (~4k pieces of text per second) ■ Very reliable for text greater than 30-40 characters CONS ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e. short tweets
  • 31. OS Projects German Word Decompounder ● https://github.com/jprante/elasticsearch-analysis-decompound ● Lucene offers two compound word token filters, a dictionary- & hyphenation-based variant ○ Not bundled with Lucene due to licensing issues ○ Require loading a word list in memory before they are run ● The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox ○ ASV Toolbox project - http://wortschatz.uni-leipzig. de/~cbiemann/software/toolbox/index.htm