Elasticsearch is a cloud-ready, super scalable search engine which is gaining a lot of popularity lately. It is mostly known for being extremely easy to setup and integrate with any technology stack.In this talk we will introduce Elasticdearch, and start by looking at some of its basic capabilities. We will demonstrate how it can be used for document search and even log analytics for DevOps and distributed debugging, and peek into more advanced usages like the real-time aggregations and percolation. Obviously, we will make sure to demonstrate how Elasticsearch can be scaled out easily to work on a distributed architecture and handle pretty much any load.
7. DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
Example from:
Justin Zobel , Alistair Moffat,
Inverted files for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
Full-text Search 101:
The inverted index
8. Full-text Search 101:
The inverted index
DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
User queries for “keeper”
12. Relevance
• Precision
The fraction of the retrieved
documents that are relevant
• Recall
The fraction of the relevant
documents that are retrieved
• Order of results
13. Challenges with search
• Relevance
• Getting the tokens right
– Tokenization
– Stemming
• Multi-lingual content
– Or other cross-cutting search concerns
• Tolerance
34. #6: Debugging a distributed system
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326 "http://www.example.com/start.html"
"Mozilla/4.08 [en] (Win98; I ;Nav)"
System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at AjaxControlToolkit.ToolkitScriptManager.GetScriptCombineAttributes(Assembly assembly)
at AjaxControlToolkit.ToolkitScriptManager.IsScriptCombinable(ScriptEntry scriptEntry)
at AjaxControlToolkit.ToolkitScriptManager.OnResolveScriptReference(ScriptReferenceEventArgs e)
at System.Web.UI.ScriptManager.RegisterScripts()
at System.Web.UI.ScriptManager.OnPagePreRenderComplete(Object sender, EventArgs e)
at System.Web.UI.Page.OnPreRenderComplete(EventArgs e)
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean
includeStagesAfterAsyncPoint)
35. #7: Distributed git storage
• PoC in C# using libgit2sharp
• https://github.com/synhershko/libgit2sharp.El
asticsearch
• Kudos @nulltoken
“I want full-text search. Oh, Aggregations!”
Kibana 4
Usually log analysis => logstash
Aggregations Framework
Live data analysis and reporting, logs as an example
Kibana as ES dashboard
You could use the same approach as before
Or this logstash pattern
Roll your own indexers
Round robin
As data comes in, fire alerts when / if it meets a criteria
Can do aggregations on matching queries
Build your own DSL. Recognizing entities.
As-you-type feedback
Span queries. Match phrase prefix.
Suggesters.
Significant terms facet for query expansion / suggestions
There are some challenges
Built-in suggesters
Offline can be also real-time; just doesn’t happen on search
Pre-processing of data to allow for faster, more exact search later
Using the percolaor
Significant terms aggregation
Process that often happens offline
“People who liked this also liked…”
Significant terms
Midtown NYC level 5 cell
It is just a matter of figuring out how to represent the piece of data in an index
What is an anomaly
Logs can be exceptions
Exceptions can be parsed – “grokked”
Logs can be exceptions
Exceptions can be parsed – “grokked”