The ultimate guide for Elasticsearch plugins

7,022 views

Published on

Elasticsearch is a great product - for search, for scale, for analyzing data, and much more. But sometimes you need to do something that is not supported by Elasticsearch out of the box, and that's where plugins come into play.
Join me in this talk to explore the plugins land of Elasticsearch. We will discuss the various ways Elasticsearch can be extended, and the various types of plugins available to do that. By giving concrete examples and browsing the large selection of pre-made plugins, we will see how plugins can help us overcome various challenges. We will also discuss possible issues with plugins, and ways to work around them.
Finally, we will discuss scenarios in which custom plugin development is necessary and can really save the day. By showing a demo of one such scenario, and the way we built and debugged a plugin to solve it, we will complete the picture of the Elasticsearch plugin land, and hopefully inspire you to create your own!

Published in: Software, Technology, Art & Photos
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,022
On SlideShare
0
From Embeds
0
Number of Embeds
3,209
Actions
Shares
0
Downloads
57
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • About me – freelancing, consultant, lucene.net committer
    Rationale: Elasticsearch can do search, aggregations, percolation and at scale
    Sometimes we need more than that
    This talk: birds eye view. Covering a lot of ground here.
    From experience
    Skim over EXPERT features
  • Rationale: Elasticsearch can do search, aggregations, percolation and at scale
  • Elasticsearch in a nutshell:
    REST, JSON wrapping Lucene
    Cluster forming and cluster metadata
    Server distributes Lucene shards (replication, sharding, multi-tenancy)
  • Not so interesting since ES discourages the use of them. There are lighter implementations
    Only applicable for query_string queries
    Have to be done via code, example will follow
  • Analysis chain very important for indexing
    Some queries will still go through the analysis chain (Match family etc)
  • What is the analysis chain?
    Splitting words
    Query term should match the indexed term.
    Term query is the most basic unit.
    Stop words obsolete => common words query
    Term query is the most basic building block of a query. Term match is what we need to have
  • What is the analysis chain?
    Splitting words
    Query term should match the indexed term.
    Term query is the most basic unit.
    Stop words obsolete => common words query
    Term query is the most basic building block of a query. Term match is what we need to have
  • Analysis chain should generally match in both ends
    There scenarios where they differ on purpose
    This is why you can set search_analyzer & index_analyzer
  • Importance of proper tokenization
    Discussion: on what characters should we tokenize?
    The curious case of email addresses
    This is why you probably want to roll your own analyzer if you are doing a lot of FTS
  • To finalize my case
  • Some basic analyzers shipping with Lucene
  • What happens when you try to
    From code – custom analyzers, token filters & token filters that you can use
  • Hebrew is a tough language to tackle
    HebMorph - Open-source solution (AGPL3)
    Requires auxiliary files
  • Powered by MVEL – Java-like syntax
    Other languages include Groovy, JS, Python
    You _could_ implement your own scripting engine
    Dynamic scripting disabled by default
    Scripts need to be loaded from disk
  • Function score query: lookup Britta’s talk
    Similarity: replacement for TF/IDF.
    Out of the box: BM25, DFR, IB and more
    EXPERT EXPERT EXPERT
  • EXPERT ONLY
    Lucene 4.0 feature
    Can provide performance boosts for searches and aggregations
  • Zoom out
    In the integration point
    Stats, management, …
  • Some of the built-in features
    Roll your own if needed
    Con: requires tons of testing, multiple deciders are at play
  • Thanks to Found
    A way to expose new plugin functionality to consumers not using Java
    Or leverage the HTTP server capabilities of ES for your requirements
    Parsing request, performing action, creating and sending response
  • Better query filtering for performance (less queries)
    Highlighting
    More logs + custom logs
    Various other optimizations
  • A la significant terms facet
    We could have done this client-side only.
    This would have been linear in time
    We made this sharded
    Java client code
    Debugging
  • A static website that can be served using ES HTTP server
  • Multicast: the more the merrier problem
    Zookepper plugin – not up to date, not official
    Aphyr’s finding re partial partitions
  • The idea behind them
    Why rivers are obsolete: node comes down, backlog
    Always prefer push over pull
    Official guidance is not to use them going forward
  • Plugin names specify the folder name under /plugins
    Node info API can provide
  • Don’t be that guy: ignore the urge to write custom stuff
    The defaults are good + A lot can be done w/ scripting
    Basically, when you really need custom distributed behavior
    Or REST endpoint exposed cluster wise
    Or EXPERT FEATURES
  • Aux data – open ES ticket for enabling analyzers to read docs
  • JAR (has to be JVM code)
    Boilerplate setup and code
    Modules; AnalysisBinderProcessor; TransportActions; RestActions;
    Everything in Elasticsearch is implemented as an Action
    Client / server reuse of request/response classes, when in Java
  • Summary
  • The ultimate guide for Elasticsearch plugins

    1. 1. Itamar Syn-Hershko http://code972.com @synhershko The ultimate guide for Elasticsearch plugins
    2. 2. Agenda • Integration points & plugin types • Showcases • Gotchas • When-to, How-to • Q & A
    3. 3. REST API Analysis chain Search Querying Query parser Lucene Index Perform indexing Indexing Make Lucene document ElasticsearchServer
    4. 4. Lucene extension points Analysis chain Search Query parser Lucene Index Perform indexing
    5. 5. Lucene extension points Analysis chain Search Query parser Lucene Index Perform indexing
    6. 6. Lucene extension points Analysis chain Search Query parser Lucene Index Perform indexing
    7. 7. Harry Potter and the Goblet of Fire Tokenizer Harry Potter and the Goblet of Fire Lower case filter harry potter and the goblet of fire Stop-words filter harry potter goblet fire Step 1: Tokenization Step 2: Filtering
    8. 8. Welcome to Malmö! Tokenizer Welcome to Malmö ASCII folding filter Lowercase filter Step 1: Tokenization Step 2: Filtering Welcome to Malmo welcome to malmo
    9. 9. Harry Potter and the Goblet of Fire Tokenizer Harry Potter and the Goblet of Fire Lower case filter harry potter and the goblet of fire Stop-words filter harry potter goblet fire Potter Tokenizer Potter Lower case filter potter Stop-words filter potter QueryIndexing
    10. 10. itamar@code972.com Tokenizer itamar code 972 com Lower case filter itamar code 972 com Step 1: Tokenization Step 2: Filtering
    11. 11. Try searching on German compound words…
    12. 12. Analyzers The quick brown fox jumped over the lazy dog, bob@hotmail.com 123432. StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com] [123432] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] [bob] [hotmail] [com] SimpleAnalyzer: [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] [bob] [hotmail] [com] WhitespaceAnalyzer: [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog,] [bob@hotmail.com] [123432.] KeywordAnalyzer: [The quick brown fox jumped over the lazy dog, bob@hotmail.com 123432.]
    13. 13. Custom analyzers from code New in Elasticsearch v1.1.0
    14. 14. Showcase: Custom Analyzer - Hebrew analysis plugin for Elasticsearch • https://github.com/synhershko/elasticsearch- analysis-hebrew • Available on QBox.io
    15. 15. Lucene extension points Analysis chain Search Query parser Lucene Index Perform indexing
    16. 16. Scripting • Sorting, filters, facets, script fields, custom scoring, aggregations, document updates • MVEL, but others are supported • Generally speaking: SLOOOOOOOW • Mostly useful as quick mocks / PoC • Native scripts using Java by implementing AbstractExecutableScript & AbstractSearchScript
    17. 17. Custom scoring & similarity • Function score query – Previously known as Custom Score Query • Similarity
    18. 18. Lucene extension points Analysis chain Search Query parser Lucene Index Perform indexing
    19. 19. Codecs
    20. 20. Black box REST API QueryingIndexing ElasticsearchServer
    21. 21. Controlling shard allocation • Filtering built in – By tags, groups, racks, IPs – Black list / white list • Total shards per node • Disk based • EXPERT: Roll your own by implementing AllocationDecider
    22. 22. Custom REST endpoints
    23. 23. Transports • Exposes the Elasticsearch RESTful API over protocols other than HTTP – Apache Thrift – Memcached – Servlet – Redis – ZeroMq
    24. 24. Showcase: Custom percolator
    25. 25. Showcase: The bubble plugin
    26. 26. Site plugins • Monitoring – BigDesk, ElasticHQ, Paramedic, … • Hammer (GUI for REST interface) • Inquisitor (debugging queries) • SegmentSpy • WhatsOn
    27. 27. Discovery • Default is Zen discovery – Unicast: I know who my nodes are – Multicast: Auto discovery for nodes • Multicast discovery support for cloud environments – AWS – Azure – Google Compute • ProTip: Unicast in production unless you know what you’re doing • ZooKeeper plugin
    28. 28. Snapshot / restore repositories • File system • AWS S3 • HDFS • Azure • Roll your own (e.g. Glacier)
    29. 29. River plugins • Obsolete • Use the “shoveller” approach • logstash, stream2es
    30. 30. Summary: Plugin types • Lucene components – Analysis – Similarity – Scoring • REST endpoints • Scripting • ES infrastructure (Discovery, Transport, Snapshot/restore) • Site plugins • River plugins
    31. 31. Installing plugins • Manual under /plugins • Official / GitHub / Maven installation: • From zip: • Plugin management:
    32. 32. When to write a plugin?
    33. 33. Writing your own plugin: Gotchas • Maintenance – the deeper you go in the API the harder it is to keep it up to date • Versioning and installation on (large) clusters – Though can be solved using puppet, docker et al • Auxiliary data (like dictionaries etc) • Testing & Debugging
    34. 34. Code: Writing your own plugin • JAR file with bootstrap code: • Embed this as es-plugin.properties: plugin=org.elasticsearch.plugin.example.ExamplePlugin
    35. 35. Thank you. Questions? Itamar Syn-Hershko http://code972.com @synhershko

    ×