Elasticsearch Distributed search & analytics on BigData made easy

Itamar Syn-Hershko
http://code972.com
@synhershko
Elasticsearch
Distributed search & analytics on
BigData made easy

Me?
• Itamar Syn-Hershko / @synhershko
• Lucene.NET PMC and lead committer
• Freelance consultant and developer
• Elasticsearch consulting partner
• Microsoft MVP
• RavenDB
– X-Core developer
– “RavenDB in Action” author
Consulting
Partner

Elasticsearch
• Powered by Apache Lucene
• Open-source
• Rapid growth
• High profile users world-wide

REST API
• Indexes
• Types
• IDs
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"user" : "synhershko",
"post_date" : "2013-05-30T14:12:12",
"message" : "trying out Elastic Search",
"followers": 3,
"registered": true
}'

DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
Example from:
Justin Zobel , Alistair Moffat,
Inverted files for text search engines,
ACM Computing Surveys (CSUR)
v.38 n.2, p.6-es, 2006
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
Full-text Search 101:
The inverted index

Full-text Search 101:
The inverted index
DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
The index:
Dictionary and
posting lists
6 documents to index
The old night keeper keeps the keep in the town1
In the big old house in the big old gown.2
The house in the town had the big old keep3
Where the old night keeper never did sleep.4
The night keeper keeps the keep in the night5
And keeps in the dark and sleeps in the light.6
User queries for “keeper”

Term Normalization DocumentsTerm
<6>and
<2> <3>big
<6>dark
<4>did
<2>gown
<3>had
<2> <3>house
<1> <2> <3> <5> <6>in
<1> <3> <5>keep
<1> <4> <5>keeper
<1> <5> <6>keeps
<6>light
<4>never
<1> <4> <5>night
<1> <2> <3> <4>old
<4>sleep
<6>sleeps
<1> <2> <3> <4> <5> <6>the
<1> <3>town
<4>where
• Lowercasing
• Stop words (grey)
• Not best practice anymore
• Stemming
• Porter stemmer
• s-stemmer
• Relevance++
• SizeOnDisk--

Full-Text Search
Your data
store

How hard is it to get search right,
anyway?

Relevance
• Precision
The fraction of the retrieved
documents that are relevant
• Recall
The fraction of the relevant
documents that are retrieved
• Order of results

Challenges with search
• Relevance
• Getting the tokens right
– Tokenization
– Stemming
• Multi-lingual content
– Or other cross-cutting search concerns
• Tolerance

Real-time Analytics
Queue
(Redis)
“Shippers”
“Indexer”

Matching inexact queries
• Phrase slop
– “Bridge of London” -> “London Bridge”
• Word-level edit distance with fuzzy queries
– ditsance -> distance
– color -> colour

Structuring the unstructured
• Record linkage
– Bag of words model
– “More Like This” functionality
• NLP
• Entity extraction

Geo-spatial search
• Distance
• Shape interactions
• Multiple algorithms

Image search
http://colors.qbox.io/

http://cs.stanford.edu/people/karpathy/deepimage
sent
Deep Visual-Semantic Alignments for
Generating Image Descriptions

The Significant Terms Aggregation

Uncommonly common
Mark Harwood’s talk at
http://www.infoq.com/presentations/elasticsear
ch-revealing-uncommonly-common

#6: Debugging a distributed system
Queue
(Redis)

#6: Debugging a distributed system
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326 "http://www.example.com/start.html"
"Mozilla/4.08 [en] (Win98; I ;Nav)"
System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at AjaxControlToolkit.ToolkitScriptManager.GetScriptCombineAttributes(Assembly assembly)
at AjaxControlToolkit.ToolkitScriptManager.IsScriptCombinable(ScriptEntry scriptEntry)
at AjaxControlToolkit.ToolkitScriptManager.OnResolveScriptReference(ScriptReferenceEventArgs e)
at System.Web.UI.ScriptManager.RegisterScripts()
at System.Web.UI.ScriptManager.OnPagePreRenderComplete(Object sender, EventArgs e)
at System.Web.UI.Page.OnPreRenderComplete(EventArgs e)
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean
includeStagesAfterAsyncPoint)

#7: Distributed git storage
• PoC in C# using libgit2sharp
• https://github.com/synhershko/libgit2sharp.El
asticsearch
• Kudos @nulltoken

Thank you.
Questions?
Itamar Syn-Hershko
http://code972.com
@synhershko

Elasticsearch Distributed search & analytics on BigData made easy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Elasticsearch Distributed search & analytics on BigData made easy

Similar to Elasticsearch Distributed search & analytics on BigData made easy (20)

Recently uploaded

Recently uploaded (20)

Elasticsearch Distributed search & analytics on BigData made easy

Editor's Notes