elasticsearch
by Yervand Aghababyan
from SFL
user of elastic for 3+ years
Thanks to
Jurriaan Persin
CTO of Engagor
For “Introduction to Elasticsearch”
On SlideShare.net
Suppose: We have a CRM
Example 1 (addresses)
• Address has many fields ( 30+ on our
screenshot)
• Some of the fields may have complex data
• Search in that complex data, parse language
• You need to have a flexible search interface for
that scary thing
What to do?
FULL TEXT indexes
• Are not easy to maintain/work with
• Are hard to change
• Not many frameworks support them
• Operations: AND, OR, NOT, nesting, wildcard
• Example: SELECT … FROM ADDRESS WHERE
a=a1 and b=b1 and c in (c1, c2, c3) and
match(d) against (d1)
Address And Company
• Remember address? Add Company data
• Company has lots of fields as well
• SQL becomes something like this:
SELECT … FROM ADDRESS A
INNER JOIN COMPANY C ON…
WHERE a.a=a1 and a.b=b1 and a.c in (c1, c2, c3)
and c.a=a2 and c.b=c2 and match(a.d,a.e) against
(d1) and match(c.d, c.e) against (d2)
Problems?
• Hard to program, too structured
• SQL’s worry performance and no way to
optimize it
• FULL TEXT indexes SUCK big time
• Search is too dependent on the data
model
• Inability to create unified (in
everything) search solutions
SQL’s worry performance?
Can you do this?
DB vs Search Engine
DBs
• Data model & consistency
• Transaction support/
Atomicity
• Triggers/Stored procedures
• Data store ( put/get)
Search Engine
• Language recognition
• Flexible searching
• Flexible data
Search Engine
• Efficient Indexing
– On all fields / combination of fields
• Analyzing data
– Text search
• Tokenizing
• Stemming
• Filtering
– Understanding/parsing locations
– Date parsing
• Relevance scoring
Tokenizing
• Finding word boundaries
– Not just .split(‘ ‘)
– Chinese has no spaces ( Not every character is a
word)
• Parse patterns
– URLs
– Emails
– #hashtags
– Twitter @usernames
Stemming
• “Stemming is the process for reducing inflected (or
sometimes derived) words to their stem, base or root
form.”
– Conjurations
– Plurals
• Example
– Fishing, Fished, Fish, Fisher -> fish
– Better -> Good
• Ways to do this
– Lookup tables
– Suffix/Prefix stripping
– Etc.
• Each language has it’s own specific stemmer
Filtering
• Remove certain words that do not matter
(stop-words)
– Different for every language
• Example: HTML
– If you’re indexing web content, most of the tags
do not matter
Location Awareness
• Geocoding of locations (longitudes and
latitudes)
• Search on location
– Bounding box searches
– Radius searches ( nearby )
– Searching by polygons (countries, states)
Relevance Scoring
• Score based on certain word matches
• Complex scoring:
– Score geo matching better than keyword
matching
– Score better if more of the context words match
– Score some keywords better than other keywords
Who does all this?
• Open Source
• Actively maintained ( last release 2015
April )
• Initially written in 1999
• Written in Java
Why not Lucene?
• It’s a library not a “database”
• It’s hard to configure and use
• Using the same index from multiple
applications/hosts is not possible
• You need to handle it’s availability/reliability
issues
• You need to handle the scaling issues
elastic
• Open Source, free to use
• Written in 2010
• Based on Lucene
• Uses same language as Lucene: Java
• Standalone server
• Has REST API
• Provides horizontal scaling
• Addresses availability issues
• Is f**king easy to use!
elastic as MySQL
Elastic MySQL
Index (and mapping) Database/Schema
Type Table
Document Row
Field Column
All stored data Index
Distributed-ness
Distributed-ness
Setup/Run
Yes, it’s NoSQL
Master node
• Only one in the cluster
• Many master eligible nodes
• Automatic master election from eligible nodes
• Warnings!
– Split brain
– Requires configuration
Discovery
• Unicast (this is the thing you know)
• Multicast
• Azure discovery
• EC2 discovery
• Google Compute Engine discovery
elastic clients
• REST Client ( the slowest option )
• Native protocol client to a single node
• Smart client ( ES node )
How to use it
1. Start it
2. Index your data into it
3. Query it
4. Index some more data into it
5. Query it some more 
Queries and Filters
Query
• Answers to: If document
matches, how well does it
match?
• Results can’t be cached
Filter
• Answers to: is the
document matching?
(yes/no)
• Fast, always use this if you
can
• Cached ( read: even more
fast)
Query/Filter types
• Boolean
• Match Query
• Fuzzy, wildcard, RegExp
• Has Parent/Child
• Range
• GeoShape
• Query String
• Span Queries
• Common Terms
(cutoff_frequency)
• Geo Filters
• Exists/Missing Filters
• Type Filters
• Term Filter
Performance
• Never had any problems with it (was lucky
with the hardware)
• Fuzzy, wildcard queries are slow
• Use Bulk indexes
• Monitor disk IO
• Monitor memory usage
• Monitor CPU usage
Inverted indexes
• These are not your normal B-Tree indexes
Memory usage
• Reduce GC time maximally
• Do not give ES too much RAM, better start 2
instances
• Disable swap
Online backups
• All nodes do simultaneous backup
• The backup should be done to a network
mounted FS
• The backup is incremental
Nested documents
• Need a JSONdocument inside another JSON
document? Do it!
Percolator
• This is the opposite of default searching
• Store your queries in the DB
• Match your documents against your query
database
Index warming
• During startup pre-warm a node so it has all
the indexes and caches in the memory and
responds fast to the very first client requests
Marvel
How we use it
• Bayazet (40M docs, 100Gb data)
• Qlim
• CallMonkey
• Greetz
• iGind
Questions?

Elasticsearch

  • 1.
    elasticsearch by Yervand Aghababyan fromSFL user of elastic for 3+ years
  • 2.
    Thanks to Jurriaan Persin CTOof Engagor For “Introduction to Elasticsearch” On SlideShare.net
  • 3.
  • 4.
    Example 1 (addresses) •Address has many fields ( 30+ on our screenshot) • Some of the fields may have complex data • Search in that complex data, parse language • You need to have a flexible search interface for that scary thing
  • 5.
  • 6.
    FULL TEXT indexes •Are not easy to maintain/work with • Are hard to change • Not many frameworks support them • Operations: AND, OR, NOT, nesting, wildcard • Example: SELECT … FROM ADDRESS WHERE a=a1 and b=b1 and c in (c1, c2, c3) and match(d) against (d1)
  • 7.
    Address And Company •Remember address? Add Company data • Company has lots of fields as well • SQL becomes something like this: SELECT … FROM ADDRESS A INNER JOIN COMPANY C ON… WHERE a.a=a1 and a.b=b1 and a.c in (c1, c2, c3) and c.a=a2 and c.b=c2 and match(a.d,a.e) against (d1) and match(c.d, c.e) against (d2)
  • 8.
    Problems? • Hard toprogram, too structured • SQL’s worry performance and no way to optimize it • FULL TEXT indexes SUCK big time • Search is too dependent on the data model • Inability to create unified (in everything) search solutions
  • 9.
  • 10.
  • 11.
    DB vs SearchEngine DBs • Data model & consistency • Transaction support/ Atomicity • Triggers/Stored procedures • Data store ( put/get) Search Engine • Language recognition • Flexible searching • Flexible data
  • 12.
    Search Engine • EfficientIndexing – On all fields / combination of fields • Analyzing data – Text search • Tokenizing • Stemming • Filtering – Understanding/parsing locations – Date parsing • Relevance scoring
  • 13.
    Tokenizing • Finding wordboundaries – Not just .split(‘ ‘) – Chinese has no spaces ( Not every character is a word) • Parse patterns – URLs – Emails – #hashtags – Twitter @usernames
  • 14.
    Stemming • “Stemming isthe process for reducing inflected (or sometimes derived) words to their stem, base or root form.” – Conjurations – Plurals • Example – Fishing, Fished, Fish, Fisher -> fish – Better -> Good • Ways to do this – Lookup tables – Suffix/Prefix stripping – Etc. • Each language has it’s own specific stemmer
  • 15.
    Filtering • Remove certainwords that do not matter (stop-words) – Different for every language • Example: HTML – If you’re indexing web content, most of the tags do not matter
  • 16.
    Location Awareness • Geocodingof locations (longitudes and latitudes) • Search on location – Bounding box searches – Radius searches ( nearby ) – Searching by polygons (countries, states)
  • 17.
    Relevance Scoring • Scorebased on certain word matches • Complex scoring: – Score geo matching better than keyword matching – Score better if more of the context words match – Score some keywords better than other keywords
  • 18.
  • 19.
    • Open Source •Actively maintained ( last release 2015 April ) • Initially written in 1999 • Written in Java
  • 20.
    Why not Lucene? •It’s a library not a “database” • It’s hard to configure and use • Using the same index from multiple applications/hosts is not possible • You need to handle it’s availability/reliability issues • You need to handle the scaling issues
  • 21.
    elastic • Open Source,free to use • Written in 2010 • Based on Lucene • Uses same language as Lucene: Java • Standalone server • Has REST API • Provides horizontal scaling • Addresses availability issues • Is f**king easy to use!
  • 22.
    elastic as MySQL ElasticMySQL Index (and mapping) Database/Schema Type Table Document Row Field Column All stored data Index
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Master node • Onlyone in the cluster • Many master eligible nodes • Automatic master election from eligible nodes • Warnings! – Split brain – Requires configuration
  • 28.
    Discovery • Unicast (thisis the thing you know) • Multicast • Azure discovery • EC2 discovery • Google Compute Engine discovery
  • 29.
    elastic clients • RESTClient ( the slowest option ) • Native protocol client to a single node • Smart client ( ES node )
  • 30.
    How to useit 1. Start it 2. Index your data into it 3. Query it 4. Index some more data into it 5. Query it some more 
  • 31.
    Queries and Filters Query •Answers to: If document matches, how well does it match? • Results can’t be cached Filter • Answers to: is the document matching? (yes/no) • Fast, always use this if you can • Cached ( read: even more fast)
  • 32.
    Query/Filter types • Boolean •Match Query • Fuzzy, wildcard, RegExp • Has Parent/Child • Range • GeoShape • Query String • Span Queries • Common Terms (cutoff_frequency) • Geo Filters • Exists/Missing Filters • Type Filters • Term Filter
  • 33.
    Performance • Never hadany problems with it (was lucky with the hardware) • Fuzzy, wildcard queries are slow • Use Bulk indexes • Monitor disk IO • Monitor memory usage • Monitor CPU usage
  • 34.
    Inverted indexes • Theseare not your normal B-Tree indexes
  • 35.
    Memory usage • ReduceGC time maximally • Do not give ES too much RAM, better start 2 instances • Disable swap
  • 36.
    Online backups • Allnodes do simultaneous backup • The backup should be done to a network mounted FS • The backup is incremental
  • 37.
    Nested documents • Needa JSONdocument inside another JSON document? Do it!
  • 38.
    Percolator • This isthe opposite of default searching • Store your queries in the DB • Match your documents against your query database
  • 39.
    Index warming • Duringstartup pre-warm a node so it has all the indexes and caches in the memory and responds fast to the very first client requests
  • 40.
  • 41.
    How we useit • Bayazet (40M docs, 100Gb data) • Qlim • CallMonkey • Greetz • iGind
  • 42.