Itamar Syn-Hershko
http://code972.com
@synhershko
The ultimate
guide for
Elasticsearch
plugins
Agenda
• Integration points & plugin types
• Showcases
• Gotchas
• When-to, How-to
• Q & A
REST API
Analysis
chain
Search
Querying
Query
parser
Lucene Index
Perform
indexing
Indexing
Make Lucene
document
ElasticsearchServer
Lucene extension points
Analysis chain
Search
Query parser
Lucene Index
Perform indexing
Lucene extension points
Analysis chain
Search
Query parser
Lucene Index
Perform indexing
Lucene extension points
Analysis chain
Search
Query parser
Lucene Index
Perform indexing
Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Step 1: Tokenization
Step 2: Filtering
Welcome to Malmö!
Tokenizer
Welcome
to
Malmö
ASCII folding
filter
Lowercase
filter
Step 1: Tokenization
Step 2: Filtering
Welcome
to
Malmo
welcome
to
malmo
Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Potter
Tokenizer
Potter
Lower case
filter
potter
Stop-words
filter
potter
QueryIndexing
itamar@code972.com
Tokenizer
itamar
code
972
com
Lower case
filter
itamar
code
972
com
Step 1: Tokenization
Step 2: Filtering
Try searching on German compound
words…
Analyzers
The quick brown fox jumped over the lazy dog,
bob@hotmail.com 123432.
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com] [123432]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob] [hotmail] [com]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [bob] [hotmail]
[com]
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog,]
[bob@hotmail.com] [123432.]
KeywordAnalyzer:
[The quick brown fox jumped over the lazy dog, bob@hotmail.com 123432.]
Custom analyzers from code
New in Elasticsearch v1.1.0
Showcase: Custom Analyzer - Hebrew
analysis plugin for Elasticsearch
• https://github.com/synhershko/elasticsearch-
analysis-hebrew
• Available on QBox.io
Lucene extension points
Analysis chain
Search
Query parser
Lucene Index
Perform indexing
Scripting
• Sorting, filters, facets, script fields, custom
scoring, aggregations, document updates
• MVEL, but others are supported
• Generally speaking: SLOOOOOOOW
• Mostly useful as quick mocks / PoC
• Native scripts using Java by implementing
AbstractExecutableScript &
AbstractSearchScript
Custom scoring & similarity
• Function score query
– Previously known as Custom Score Query
• Similarity
Lucene extension points
Analysis chain
Search
Query parser
Lucene Index
Perform indexing
Codecs
Black box
REST API
QueryingIndexing
ElasticsearchServer
Controlling shard allocation
• Filtering built in
– By tags, groups, racks, IPs
– Black list / white list
• Total shards per node
• Disk based
• EXPERT: Roll your own by implementing
AllocationDecider
Custom REST endpoints
Transports
• Exposes the Elasticsearch RESTful API over
protocols other than HTTP
– Apache Thrift
– Memcached
– Servlet
– Redis
– ZeroMq
Showcase: Custom percolator
Showcase: The bubble plugin
Site plugins
• Monitoring
– BigDesk, ElasticHQ, Paramedic, …
• Hammer (GUI for REST interface)
• Inquisitor (debugging queries)
• SegmentSpy
• WhatsOn
Discovery
• Default is Zen discovery
– Unicast: I know who my nodes are
– Multicast: Auto discovery for nodes
• Multicast discovery support for cloud
environments
– AWS
– Azure
– Google Compute
• ProTip: Unicast in production unless you know
what you’re doing
• ZooKeeper plugin
Snapshot / restore repositories
• File system
• AWS S3
• HDFS
• Azure
• Roll your own (e.g. Glacier)
River plugins
• Obsolete
• Use the “shoveller” approach
• logstash, stream2es
Summary: Plugin types
• Lucene components
– Analysis
– Similarity
– Scoring
• REST endpoints
• Scripting
• ES infrastructure (Discovery, Transport,
Snapshot/restore)
• Site plugins
• River plugins
Installing plugins
• Manual under /plugins
• Official / GitHub / Maven installation:
• From zip:
• Plugin management:
When to write a plugin?
Writing your own plugin: Gotchas
• Maintenance – the deeper you go in the API
the harder it is to keep it up to date
• Versioning and installation on (large) clusters
– Though can be solved using puppet, docker et al
• Auxiliary data (like dictionaries etc)
• Testing & Debugging
Code: Writing your own plugin
• JAR file with bootstrap code:
• Embed this as es-plugin.properties:
plugin=org.elasticsearch.plugin.example.ExamplePlugin
Thank you.
Questions?
Itamar Syn-Hershko
http://code972.com
@synhershko

The ultimate guide for Elasticsearch plugins

Editor's Notes

  • #2 About me – freelancing, consultant, lucene.net committer Rationale: Elasticsearch can do search, aggregations, percolation and at scale Sometimes we need more than that This talk: birds eye view. Covering a lot of ground here. From experience Skim over EXPERT features
  • #3 Rationale: Elasticsearch can do search, aggregations, percolation and at scale
  • #4 Elasticsearch in a nutshell: REST, JSON wrapping Lucene Cluster forming and cluster metadata Server distributes Lucene shards (replication, sharding, multi-tenancy)
  • #6 Not so interesting since ES discourages the use of them. There are lighter implementations Only applicable for query_string queries Have to be done via code, example will follow
  • #7 Analysis chain very important for indexing Some queries will still go through the analysis chain (Match family etc)
  • #8 What is the analysis chain? Splitting words Query term should match the indexed term. Term query is the most basic unit. Stop words obsolete => common words query Term query is the most basic building block of a query. Term match is what we need to have
  • #9 What is the analysis chain? Splitting words Query term should match the indexed term. Term query is the most basic unit. Stop words obsolete => common words query Term query is the most basic building block of a query. Term match is what we need to have
  • #10 Analysis chain should generally match in both ends There scenarios where they differ on purpose This is why you can set search_analyzer & index_analyzer
  • #11 Importance of proper tokenization Discussion: on what characters should we tokenize? The curious case of email addresses This is why you probably want to roll your own analyzer if you are doing a lot of FTS
  • #12 To finalize my case
  • #13 Some basic analyzers shipping with Lucene
  • #14 What happens when you try to From code – custom analyzers, token filters & token filters that you can use
  • #15 Hebrew is a tough language to tackle HebMorph - Open-source solution (AGPL3) Requires auxiliary files
  • #17 Powered by MVEL – Java-like syntax Other languages include Groovy, JS, Python You _could_ implement your own scripting engine Dynamic scripting disabled by default Scripts need to be loaded from disk
  • #18 Function score query: lookup Britta’s talk Similarity: replacement for TF/IDF. Out of the box: BM25, DFR, IB and more EXPERT EXPERT EXPERT
  • #20 EXPERT ONLY Lucene 4.0 feature Can provide performance boosts for searches and aggregations
  • #21 Zoom out In the integration point Stats, management, …
  • #22 Some of the built-in features Roll your own if needed Con: requires tons of testing, multiple deciders are at play
  • #23 Thanks to Found A way to expose new plugin functionality to consumers not using Java Or leverage the HTTP server capabilities of ES for your requirements Parsing request, performing action, creating and sending response
  • #25 Better query filtering for performance (less queries) Highlighting More logs + custom logs Various other optimizations
  • #26 A la significant terms facet We could have done this client-side only. This would have been linear in time We made this sharded Java client code Debugging
  • #27 A static website that can be served using ES HTTP server
  • #28 Multicast: the more the merrier problem Zookepper plugin – not up to date, not official Aphyr’s finding re partial partitions
  • #30 The idea behind them Why rivers are obsolete: node comes down, backlog Always prefer push over pull Official guidance is not to use them going forward
  • #32 Plugin names specify the folder name under /plugins Node info API can provide
  • #33 Don’t be that guy: ignore the urge to write custom stuff The defaults are good + A lot can be done w/ scripting Basically, when you really need custom distributed behavior Or REST endpoint exposed cluster wise Or EXPERT FEATURES
  • #34 Aux data – open ES ticket for enabling analyzers to read docs
  • #35 JAR (has to be JVM code) Boilerplate setup and code Modules; AnalysisBinderProcessor; TransportActions; RestActions; Everything in Elasticsearch is implemented as an Action Client / server reuse of request/response classes, when in Java
  • #36 Summary