Elasticsearch

elasticsearch
by Yervand Aghababyan
from SFL
user of elastic for 3+ years

Thanks to
Jurriaan Persin
CTO of Engagor
For “Introduction to Elasticsearch”
On SlideShare.net

Example 1 (addresses)
• Address has many fields ( 30+ on our
screenshot)
• Some of the fields may have complex data
• Search in that complex data, parse language
• You need to have a flexible search interface for
that scary thing

FULL TEXT indexes
• Are not easy to maintain/work with
• Are hard to change
• Not many frameworks support them
• Operations: AND, OR, NOT, nesting, wildcard
• Example: SELECT … FROM ADDRESS WHERE
a=a1 and b=b1 and c in (c1, c2, c3) and
match(d) against (d1)

Address And Company
• Remember address? Add Company data
• Company has lots of fields as well
• SQL becomes something like this:
SELECT … FROM ADDRESS A
INNER JOIN COMPANY C ON…
WHERE a.a=a1 and a.b=b1 and a.c in (c1, c2, c3)
and c.a=a2 and c.b=c2 and match(a.d,a.e) against
(d1) and match(c.d, c.e) against (d2)

Problems?
• Hard to program, too structured
• SQL’s worry performance and no way to
optimize it
• FULL TEXT indexes SUCK big time
• Search is too dependent on the data
model
• Inability to create unified (in
everything) search solutions

DB vs Search Engine
DBs
• Data model & consistency
• Transaction support/
Atomicity
• Triggers/Stored procedures
• Data store ( put/get)
Search Engine
• Language recognition
• Flexible searching
• Flexible data

Search Engine
• Efficient Indexing
– On all fields / combination of fields
• Analyzing data
– Text search
• Tokenizing
• Stemming
• Filtering
– Understanding/parsing locations
– Date parsing
• Relevance scoring

Tokenizing
• Finding word boundaries
– Not just .split(‘ ‘)
– Chinese has no spaces ( Not every character is a
word)
• Parse patterns
– URLs
– Emails
– #hashtags
– Twitter @usernames

Stemming
• “Stemming is the process for reducing inflected (or
sometimes derived) words to their stem, base or root
form.”
– Conjurations
– Plurals
• Example
– Fishing, Fished, Fish, Fisher -> fish
– Better -> Good
• Ways to do this
– Lookup tables
– Suffix/Prefix stripping
– Etc.
• Each language has it’s own specific stemmer

Filtering
• Remove certain words that do not matter
(stop-words)
– Different for every language
• Example: HTML
– If you’re indexing web content, most of the tags
do not matter

Location Awareness
• Geocoding of locations (longitudes and
latitudes)
• Search on location
– Bounding box searches
– Radius searches ( nearby )
– Searching by polygons (countries, states)

Relevance Scoring
• Score based on certain word matches
• Complex scoring:
– Score geo matching better than keyword
matching
– Score better if more of the context words match
– Score some keywords better than other keywords

• Open Source
• Actively maintained ( last release 2015
April )
• Initially written in 1999
• Written in Java

Why not Lucene?
• It’s a library not a “database”
• It’s hard to configure and use
• Using the same index from multiple
applications/hosts is not possible
• You need to handle it’s availability/reliability
issues
• You need to handle the scaling issues

elastic
• Open Source, free to use
• Written in 2010
• Based on Lucene
• Uses same language as Lucene: Java
• Standalone server
• Has REST API
• Provides horizontal scaling
• Addresses availability issues
• Is f**king easy to use!

elastic as MySQL
Elastic MySQL
Index (and mapping) Database/Schema
Type Table
Document Row
Field Column
All stored data Index

Master node
• Only one in the cluster
• Many master eligible nodes
• Automatic master election from eligible nodes
• Warnings!
– Split brain
– Requires configuration

Discovery
• Unicast (this is the thing you know)
• Multicast
• Azure discovery
• EC2 discovery
• Google Compute Engine discovery

elastic clients
• REST Client ( the slowest option )
• Native protocol client to a single node
• Smart client ( ES node )

How to use it
1. Start it
2. Index your data into it
3. Query it
4. Index some more data into it
5. Query it some more 

Queries and Filters
Query
• Answers to: If document
matches, how well does it
match?
• Results can’t be cached
Filter
• Answers to: is the
document matching?
(yes/no)
• Fast, always use this if you
can
• Cached ( read: even more
fast)

Query/Filter types
• Boolean
• Match Query
• Fuzzy, wildcard, RegExp
• Has Parent/Child
• Range
• GeoShape
• Query String
• Span Queries
• Common Terms
(cutoff_frequency)
• Geo Filters
• Exists/Missing Filters
• Type Filters
• Term Filter

Performance
• Never had any problems with it (was lucky
with the hardware)
• Fuzzy, wildcard queries are slow
• Use Bulk indexes
• Monitor disk IO
• Monitor memory usage
• Monitor CPU usage

Inverted indexes
• These are not your normal B-Tree indexes

Memory usage
• Reduce GC time maximally
• Do not give ES too much RAM, better start 2
instances
• Disable swap

Online backups
• All nodes do simultaneous backup
• The backup should be done to a network
mounted FS
• The backup is incremental

Nested documents
• Need a JSONdocument inside another JSON
document? Do it!

Percolator
• This is the opposite of default searching
• Store your queries in the DB
• Match your documents against your query
database

Index warming
• During startup pre-warm a node so it has all
the indexes and caches in the memory and
responds fast to the very first client requests

How we use it
• Bayazet (40M docs, 100Gb data)
• Qlim
• CallMonkey
• Greetz
• iGind

Elasticsearch

More Related Content

What's hot

Similar to Elasticsearch

Recently uploaded

Elasticsearch