Your Data, Your Search, ElasticSearch (EURUKO 2011)

Your Data,
Your Search

Karel Minařík

http://karmi.cz

ElasticSearch

Your Data,
Your Search

Karel Minařík and Florian Hanke

Search is the primary interface
for getting information today.

ElasticSearch

http://www.apple.com/macosx/what-is-macosx/spotlight.html

# https://github.com/rubygems/gemcutter/blob/master/app/models/rubygem.rb#L29-33
#
def self.search(query)
where("versions.indexed and (upper(name) like upper(:query) or
upper(versions.description) like upper(:query))", {:query => "%#{query.strip}%"}).
includes(:versions).
order("rubygems.downloads desc")
end

Search (mostly) sucks.
Why?

ElasticSearch

WHY SEARCH SUCKS?

How do you implement search?

class MyModel
include Whatever::Search
end

MyModel.search "something"

WHY SEARCH SUCKS?


class MyModel
include Whatever::Search
MAGIC
end

MyModel.search "whatever"

WHY SEARCH SUCKS?


Query Results Result

def search
@results = MyModel.search params[:q]
respond_with @results
end

WHY SEARCH SUCKS?



MAGIC

def search
end

WHY SEARCH SUCKS?



MAGIC +

def search
end

23px

670px

A personal story...

WHY SEARCH SUCKS?

Compare your search library with your ORM library

MyModel.search "(this OR that) AND NOT whatever"

Arel::Table.new(:articles).
where(articles[:title].eq('On Search')).
where(["published_on => ?", Time.now]).
join(comments).
on(article[:id].eq(comments[:article_id]))
take(5).
skip(4).
to_sql

Your data, your search.

ElasticSearch

HOW DOES SEARCH WORK?

A collection of documents

file_1.txt
The ruby is a pink to blood-‐red colored gemstone ...

file_2.txt
Ruby is a dynamic, reflective, general-‐purpose object-‐oriented
programming language ...

file_3.txt
"Ruby" is a song by English rock band Kaiser Chiefs ...


How do you search documents?

File.read('file1.txt').include?('ruby')


The inverted index

TOKENS POSTINGS

ruby file_1.txt file_2.txt file_3.txt
pink file_1.txt
gemstone file_1.txt

dynamic file_2.txt
reflective file_2.txt
programming file_2.txt

song file_3.txt
english file_3.txt
rock file_3.txt

http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices


The inverted index

MySearchLib.search "ruby"

pink file_1.txt
gemstone file_1.txt

dynamic file_2.txt

song file_3.txt
english file_3.txt
rock file_3.txt



The inverted index

MySearchLib.search "song"

pink file_1.txt
gemstone file_1.txt

dynamic file_2.txt

song file_3.txt
english file_3.txt
rock file_3.txt


module SimpleSearch

def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end

def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# >>> Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end

def store document_id, tokens
tokens.each do |token|
# >>> Save the "posting"
( (INDEX[token] ||= []) << document_id ).uniq!
end
end

def search token
puts "Results for token '#{token}':"
# >>> Print documents stored in index for this token
INDEX[token].each { |document| " * #{document}" }
end

INDEX = {}
STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t

extend self

end
A naïve Ruby implementation


Indexing documents

SimpleSearch.index "file1", "Ruby is a language. Java is also a language.
SimpleSearch.index "file2", "Ruby is a song."
SimpleSearch.index "file3", "Ruby is a stone."
SimpleSearch.index "file4", "Java is a language."

Indexed document file1 with tokens:
["ruby", "language", "java", "also", "language"]

["ruby", "song"] Words downcased,
stopwords removed.
["ruby", "stone"]

["java", "language"]


The index

puts "What's in our index?"
p SimpleSearch::INDEX
{
"ruby" => ["file1", "file2", "file3"],
"language" => ["file1", "file4"],
"java" => ["file1", "file4"],
"also" => ["file1"],
"stone" => ["file3"],
"song" => ["file2"]
}


Search the index

SimpleSearch.search "ruby"
Results for token 'ruby':
* file1
* file2
* file3


The inverted index

TOKENS POSTINGS

ruby 3 file_1.txt file_2.txt file_3.txt
pink 1 file_1.txt
gemstone file_1.txt

dynamic file_2.txt

song file_3.txt
english file_3.txt
rock file_3.txt


It is very practical to know how search works.

For instance, now you know that
the analysis step is very important.

Most of the time, it's more important than the search step.

ElasticSearch


The Search Engine Textbook

Search Engines
Information Retrieval in Practice
Bruce Croft, Donald Metzler and Trevor Strohma
Addison Wesley, 2009

http://search-engines-book.com

SEARCH IMPLEMENTATIONS

The Baseline Information Retrieval Implementation

Lucene in Action
Michael McCandless, Erik Hatcher and Otis Gospodnetic
July, 2010

http://manning.com/hatcher3

{ }
HTTP
JSON
Schema-free
Index as Resource
Distributed
Queries
Facets
Mapping
Ruby
ElasticSearch

ELASTICSEARCH FEATURES

HTTP JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
# Add document
curl -‐X POST "http://localhost:9200/articles/article/1" -‐d '{ "title" : "One" }'
# Query
curl -‐X GET "http://localhost:9200/articles/_search?q=One"
curl -‐X POST "http://localhost:9200/articles/_search" -‐d '{
INDEX TYPE ID
"query" : { "terms" : { "tags" : ["ruby", "python"], "minimum_match" : 2 } }
}'
# Delete index
curl -‐X DELETE "http://localhost:9200/articles"
# Create index with settings and mapping
curl -‐X PUT "http://localhost:9200/articles" -‐d '
{ "settings" : { "index" : "number_of_shards" : 3, "number_of_replicas" : 2 }},
{ "mappings" : { "document" : {
"properties" : {
"body" : { "type" : "string", "analyzer" : "snowball" }
}
} }
}'


# Add document
curl -‐X POST "http://localhost:9200/articles/article/1" -‐d '{ "title" : "One" }'

# Query
curl -‐X GET "http://localhost:9200/articles/_search?q=One"
curl -‐X POST "http://localhost:9200/articles/_search" -‐d '{
"query" : { "terms" : { "tags" : ["ruby", "python"], "minimum_match" : 2 } }
}'

# Delete index

# Create index with settings and mapping
curl -‐X PUT "http://localhost:9200/articles" -‐d '
{ "settings" : { "index" : "number_of_shards" : 3, "number_of_replicas" : 2 }},
{ "mappings" : { "document" : {
"properties" : {
"body" : { "type" : "string", "analyzer" : "snowball" }
}
} }
}'


http { GET http://user:password@localhost:8080/_search?q=* => http://localhost:9200/user/_search?q=*
server {

listen 8080;
server_name search.example.com;

error_log elasticsearch-‐errors.log;
access_log elasticsearch.log;

location / {

# Deny access to Cluster API
if ($request_filename ~ "_cluster") {
return 403;
break;
}

# Pass requests to ElasticSearch
proxy_pass http://localhost:9200;
proxy_redirect off;

proxy_set_header X-‐Real-‐IP $remote_addr;
proxy_set_header X-‐Forwarded-‐For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;

# Authorize access
auth_basic "ElasticSearch";
auth_basic_user_file passwords;

# Route all requests to authorized user's own index
rewrite ^(.*)$ /$remote_user$1 break;
rewrite_log on;

return 403;

}
https://gist.github.com/986390
}


JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
ON
HTTP /

JS
{
"id" : "abc123",

"title" : "ElasticSearch Understands JSON!",

"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first .

"published_on" : "2011/05/27 10:00:00",

"tags" : ["search", "json"],

"author" : {
"first_name" : "Clara",
"last_name" : "Rice",
"email" : "clara@rice.org"
}
}


HTTP / JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X DELETE "http://localhost:9200/articles"; sleep 1
curl -‐X POST "http://localhost:9200/articles/article" -‐d '
{
"id" : "abc123",


"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first .

"published_on" : "2011/05/27 10:00:00",


"author" : {
}
}'
curl -‐X POST "http://localhost:9200/articles/_refresh"

curl -‐X GET
"http://localhost:9200/articles/article/_search?q=author.first_name:clara"


HTTP / JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X GET "http://localhost:9200/articles/_mapping?pretty=true"
{
"articles" : {
"article" : {
"properties" : {
"title" : {
"type" : "string"
},
// ...
"author" : {
"dynamic" : "true",
"properties" : {
"first_name" : {
"type" : "string"
},
// ...
}
},
"published_on" : {
"format" : "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd",
"type" : "date"
}
}
}
}
}


HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby
curl -‐X POST "http://localhost:9200/articles/comment" -‐d '
{

"body" : "Wow! Really nice JSON support.",

"published_on" : "2011/05/27 10:05:00",

"author" : {
"first_name" : "John",
"last_name" : "Pear",
"email" : "john@pear.org"
}
}'

curl -‐X GET
"http://localhost:9200/articles/comment/_search?q=author.first_name:john"


curl -‐X GET
"http://localhost:9200/articles/comment/_search?q=body:json"

curl -‐X GET
"http://localhost:9200/articles/_search?q=body:json"

curl -‐X GET
"http://localhost:9200/articles,users/_search?q=body:json"

curl -‐X GET
"http://localhost:9200/_search?q=body:json"


curl -‐X DELETE "http://localhost:9200/articles"; sleep 1
{
"id" : "abc123",


"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first ...",

"published_on" : "2011/05/27 10:00:00",


"author" : {
}
}'

curl -‐X GET "http://localhost:9200/articles/article/1"


{"_index":"articles","_type":"article","_id":"1","_version":1, "_source" :
{
"id" : "1",


"body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s

"published_on" : "2011/05/27 10:00:00",


"author" : {
}
}}

The Index Is Your Database.


Index Aliases

curl -‐X POST 'http://localhost:9200/_aliases' -‐d '
{
"actions" : [
{ "add" : {
index_A "index" : "index_1",
"alias" : "myalias"
my_alias }
},
{ "add" : {
"index" : "index_2",
"alias" : "myalias"
index_B }
}
]
}'

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html


The “Sliding Window” problem

curl -‐X DELETE http://localhost:9200 / logs_2010_01

logs_2010_02

logs

logs_2010_03

logs_2010_04

“We can really store only three months worth of data.”


Index Templates

curl -‐X PUT localhost:9200/_template/bookmarks_template -‐d '
{
"template" : "users_*", Apply this configuration
for every matching
"settings" : { index being created
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 3
}
},

"mappings": {
"url": {
"properties": {
"url": {
"type": "string", "analyzer": "simple", "boost": 10
},
"title": {
"type": "string", "analyzer": "snowball", "boost": 5
}
// ...
}
}
}
}
'
http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html



$ cat elasticsearch.yml
cluster:
name: <YOUR APPLICATION>

Automatic Discovery Protocol

MASTER
Node 1 Node 2 Node 3 Node 4

http://www.elasticsearch.org/guide/reference/modules/discovery/



Index A is split into 3 shards, and duplicated in 2 replicas.

A1 A1' A1'' Replicas
A2 A2' A2''

A3 A3' A3''
curl -‐XPUT 'http://localhost:9200/A/' -‐d '{
"settings" : {
"index" : {
Shards "number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'


Im
pr

ce
ove

an
rm
in
de

rfo
xi

pe
ng

h
pe

a rc
rfo

se
rm

e
ov
an

pr
ce

Im
SH
AR

AS
DS

IC
PL
RE


HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby
$ curl -‐X GET "http://localhost:9200/_search?q=<YOUR QUERY>"

apple
Terms
apple iphone
Phrases "apple iphone"

Proximity "apple safari"~5

Fuzzy apple~0.8
app*
Wildcards
*pp*
Boosting apple^10 safari
[2011/05/01 TO 2011/05/31]
Range
[java TO json]
apple AND NOT iphone
+apple -‐iphone
Boolean
(apple OR iphone) AND NOT review

title:iphone^15 OR body:iphone
Fields published_on:[2011/05/01 TO "2011/05/27 10:00:00"]

http://lucene.apache.org/java/3_1_0/queryparsersyntax.html


Queries / Facets / Mapping / Ruby
ON
HTTP / JSON / Schema Free / Distributed /

JS
Query DSL

curl -‐X POST "http://localhost:9200/articles/_search?pretty=true" -‐d '
{
"query" : {
"terms" : {
"tags" : [ "ruby", "python" ],
"minimum_match" : 2
}
}
}'

http://www.elasticsearch.org/guide/reference/query-dsl/


Geo Search

curl -‐X POST "http://localhost:9200/venues/venue" -‐d '
{ Accepted formats for Geo:
"name": "Pizzeria",
"pin": { [lon, lat] # Array
"location": {
"lat": 50.071712,
"lat,lon" # String
"lon": 14.386832 drm3btev3e86 # Geohash
}
}
}'

curl -‐X POST "http://localhost:9200/venues/_search?pretty=true" -‐d '
{
"query" : {
"filtered" : {
"query" : { "query_string" : { "query" : "pizzeria" } },
"filter" : {
"geo_distance" : {
"distance" : "0.5km",
"pin.location" : { "lat" : 50.071481, "lon" : 14.387284 }
}
}
}
}
}'

http://www.elasticsearch.org/guide/reference/query-dsl/geo-distance-filter.html



Query

http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/



{
"query" : {
"query_string" : { "query" : "title:T*"} User query
},
"filter" : {
"terms" : { "tags" : ["ruby"] } “Checkboxes”
},
"facets" : {
"tags" : {
"terms" : { Facets
"field" : "tags",
"size" : 10
}
}
}
}'

# facets" : {
# "tags" : {
# "terms" : [ {
# "term" : "ruby",
# "count" : 2
# }, {
# "term" : "python",
# "count" : 1
# }, {
# "term" : "java",
# "count" : 1
# } ]
# }
# }

http://www.elasticsearch.org/guide/reference/api/search/facets/index.html



{
"facets" : {
"published_on" : {
"date_histogram" : {
"field" : "published",
"interval" : "day"
}
}
}
}'


Geo Distance Facets

curl -‐X POST "http://localhost:9200/venues/_search?pretty=true" -‐d '
{
"query" : { "query_string" : { "query" : "pizzeria" } },
"facets" : {
"distance_count" : {
"geo_distance" : {
"pin.location" : {
"lat" : 50.071712,
"lon" : 14.386832
},
"ranges" : [
{ "to" : 1 },
{ "from" : 1, "to" : 5 },
{ "from" : 5, "to" : 10 }
]
}
}
}
}'

http://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.html


{
"mappings": {
"article": {
"properties": {
"tags": {
"type": "string",
"analyzer": "keyword"
},
"content": {
"type": "string",
"analyzer": "snowball"
},
"title": {
"type": "string",
"analyzer": "snowball",
"boost": 10.0
}
}
}
}
}'

curl -‐X GET 'http://localhost:9200/articles/_mapping?pretty=true'
Remember?
def analyze content
# >>> Split content by words into "tokens"
content.split(/W/).
# >>> Downcase every word
map { |word| word.downcase }.
# ...
http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html end


curl -‐X DELETE "http://localhost:9200/urls"
curl -‐X POST "http://localhost:9200/urls/url" -‐d '
{
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"url_analyzer" : {
"type" : "custom",
"tokenizer" : "lowercase",
"filter" : ["stop", "url_stop", "url_ngram"]
}
},
"filter" : {
"url_stop" : {
"type" : "stop",
"stopwords" : ["http", "https", "www"]
},
"url_ngram" : {
"type" : "nGram",
"min_gram" : 3,
"max_gram" : 5
}
}
}
}
}
}'



curl -‐X PUT localhost:9200/urls/url/_mapping -‐d '
{
"url": {
"properties": { "url": { "type": "string", "analyzer": "url_analyzer" } }
}
}'

curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://urlaubinkroatien.de" }'
curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://besteurlaubinkroatien.de" }'
curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://kroatien.de" }'
curl -‐X POST localhost:9200/urls/_refresh

curl "http://localhost:9200/urls/_search?pretty=true&q=url:kroatien"

curl "http://localhost:9200/urls/_search?pretty=true&q=url:urlaub"

curl "http://localhost:9200/urls/_search?pretty=true&q=url:(urlaub AND kroatien)"




K R O A T I E N
K R O

}
R O A
O A T
Trigrams
A T I
T I E
I E N



Tire.index 'articles' do
delete
create

store :title => 'One', :tags => ['ruby'], :published_on => '2011-‐01-‐01'
store :title => 'Two', :tags => ['ruby', 'python'], :published_on => '2011-‐01-‐02'
store :title => 'Three', :tags => ['java'], :published_on => '2011-‐01-‐02'
store :title => 'Four', :tags => ['ruby', 'php'], :published_on => '2011-‐01-‐03'

refresh
end

s = Tire.search 'articles' do
query { string 'title:T*' }

filter :terms, :tags => ['ruby']

sort { title 'desc' }

http://github.com/karmi/tire
facet 'global-‐tags' { terms :tags, :global => true }

facet 'current-‐tags' { terms :tags }
end



class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
end

$ rake environment tire:import CLASS='Article'

Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end




class Article
include Whatever::ORM

include Tire::Model::Search
include Tire::Model::Callbacks
end

$ rake environment tire:import CLASS='Article'

Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { published_on 'desc' }
end


Try ElasticSearch and Tire with a one-line command.

$ rails new tired -‐m "https://gist.github.com/raw/951343/tired.rb"

A “batteries included” installation.
Downloads and launches ElasticSearch.
Sets up a Rails applicationand and launches it.
When you're tired of it, just delete the folder.

Your Data, Your Search, ElasticSearch (EURUKO 2011)

More Related Content

What's hot

Viewers also liked

Similar to Your Data, Your Search, ElasticSearch (EURUKO 2011)

More from Karel Minarik

Recently uploaded

Your Data, Your Search, ElasticSearch (EURUKO 2011)