2. It’s all about Search
• How does search work?
• ElasticSearch
• Tire
Wednesday, February 6, 13 2
3. How does search work?
A collection of articles
• Article.find(1).to_json
{ title: “One”, content: “The ruby is a pink to blood-red colored gemstone.” }
• Article.find(2).to_json
{ title: “Two”, content: “Ruby is a dynamic, reflective, general-purpose object-
oriented programming language.” }
• Article.find(3).to_json
{ title: “Three”, content: “Ruby is a song by English rock band.” }
Wednesday, February 6, 13 3
4. How does search work?
How do you search?
Article.where(“content like ?”, “%ruby%”)
Wednesday, February 6, 13 4
5. How does search work?
The inverted index
T0 = “it is what it is”
T1 = “what is it”
T2 = “it is a banana”
“a”: {2}
“banana”: {2}
“is”: {0, 1, 2}
“it”: {0, 1, 2}
“what”: {0, 1}
A term search for the terms “what”, “is” and “it”
{0, 1} ∩ {0, 1} ∩ {0, 1, 2} = {0, 1}
Wednesday, February 6, 13 5
6. How does search work?
The inverted index
TOKEN ARTICLES
ruby article_1 article_2 article_3
pink article_1
gemstone article_1
dynamic article_2
reflective article_2
programming article_2
song article_3
english article_3
rock article_3
Wednesday, February 6, 13 6
7. How does search work?
The inverted index
Article.search(“ruby”)
ruby article_1 article_2 article_3
pink article_1
gemstone article_1
dynamic article_2
reflective article_2
programming article_2
song article_3
english article_3
rock article_3
Wednesday, February 6, 13 7
8. How does search work?
The inverted index
Article.search(“song”)
ruby article_1 article_2 article_3
pink article_1
gemstone article_1
dynamic article_2
reflective article_2
programming article_2
song article_3
english article_3
rock article_3
Wednesday, February 6, 13 8
9. module SimpleSearch
def index document, content
tokens = analyze content
store document, tokens
puts "Indexed document #{document} with tokens:", tokens.inspect, "n"
end
def analyze content
# Split content by words into "tokens"
content.split(/W/).
# Downcase every word
map { |word| word.downcase }.
# Reject stop words, digits and whitespace
reject { |word| STOPWORDS.include?(word) || word =~ /^d+/ || word == '' }
end
def store document_id, tokens
tokens.each do |token|
((INDEX[token] ||= []) << document_id).uniq!
end
end
def search token
puts "Results for token '#{token}':"
INDEX[token].each { |document| " * #{document}" }
end
INDEX = {}
STOPWORDS = %w(a an and are as at but by for if in is it no not of on or that the then there)
extend self
end
Wednesday, February 6, 13 9
10. How does search work?
Indexing documents
SimpleSearch.index “article1”, “Ruby is a language. Java is also a language.”
SimpleSearch.index “article2”, “Ruby is a song.”
SimpleSearch.index “article3”, “Ruby is a stone.”
SimpleSearch.index “article4”, “Java is a language.”
Wednesday, February 6, 13 10
11. How does search work?
Indexing documents
SimpleSearch.index “article1”, “Ruby is a language. Java is also a language.”
SimpleSearch.index “article2”, “Ruby is a song.”
SimpleSearch.index “article3”, “Ruby is a stone.”
SimpleSearch.index “article4”, “Java is a language.”
Indexed document article1 with tokens:
[“ruby”, “language”, “java”, “also”, “language”]
Indexed document article2 with tokens:
[“ruby”, “song”]
Indexed document article3 with tokens:
[“ruby”, “stone”]
Indexed document article4 with tokens:
[“java”, “language”]
Wednesday, February 6, 13 11
12. How does search work?
Index
print SimpleSearch::INDEX
{
“ruby” => [“article1”, “article2”, “article3”],
“language” => [“article1”, “article4”],
“java” => [“article1”, “article4”],
“also” => [“article1”],
“stone” => [“article3”],
“song” => [“article2”]
}
Wednesday, February 6, 13 12
13. How does search work?
Search the index
SimpleSearch.search “ruby”
Results for token ‘ruby’:
* article1
* article2
* article3
Wednesday, February 6, 13 13
14. How does search work?
Search is ...
Inverted Index
{ “ruby”: [1,2,3], “language”: [1,4] }
+
Relevance Scoring
• How many matching terms does this document contain?
• How frequently does each term appear in all your documents?
• ... other complicated algorithms.
Wednesday, February 6, 13 14
15. ElasticSearch
ElasticSearch is an Open Source (Apache 2),
Distributed, RESTful, Search Engine built on
top of Apache Lucene.
http://github.com/elasticsearch/elasticsearch
Wednesday, February 6, 13 15
16. ElasticSearch
Terminology
Relational DB ElasticSearch
Database Index
Table Type
Row Document
Column Field
Schema Mapping
Index *Everything
SQL query DSL
Wednesday, February 6, 13 16
19. ElasticSearch
Distributed
The discovery module is responsible for discovering nodes within a
cluster, as well as electing a master node.
The responsibility of the master node is to maintain the global cluster
global cluster state, and act if nodes join or leave the cluster by
reassigning shards.
Automatic Discovery Protocol
Node 1 Node 2 Node 3 Node 4
Master
Wednesday, February 6, 13 19
20. ElasticSearch
Distributed
by default, every Index will split into 5 shards and duplicated in 1 replicas.
Index A
A1 A2 A3 A4 A5 Shards
A1’ A2’ A3’ A4’ A5’ Replicas
Wednesday, February 6, 13 20
21. ElasticSearch
Query DSL
Queries Filters
- query_string - term
- term - query
- wildcard - range
- boosting - bool
- bool - and
- filtered - or
- fuzzy - not
- range - limit
- geo_shape - match_all
- ... - ...
Wednesday, February 6, 13 21
22. ElasticSearch
Query DSL
Queries Filters
- query_string - term
- term - query
- wildcard
With Relevance - With Cache
range
- boosting
Without Cache - bool
Without Relevance
- bool - and
- filtered - or
- fuzzy - not
- range - limit
- geo_shape - match_all
- ... - ...
Wednesday, February 6, 13 22
26. ElasticSearch
Analyzer
curl -XPUT 'http://localhost:9200/articles/article/_mapping' -d '
{
“article”: {
“properties”: { “title”: { “type”: “string”, “analyzer”: “trigrams” } }
}
}’
curl -XPUT ‘localhost:9200/articles/article -d ‘{ “title”: “cupertino” }’
C u p e r t i n o
C u p
u p e
p e r
. . .
Wednesday, February 6, 13 26
27. Tire
A rich Ruby API and DSL for the
ElasticSearch search engine.
http://github.com/karmi/tire/
Wednesday, February 6, 13 27
28. Tire
ActiveRecord Integration
# New rails application
$ rails new searchapp -m https://raw.github.com/karmi/tire/master/examples/rails-application-template.rb
# Callback
class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
end
# Create a article
Article.create :title => "I Love Elasticsearch",
:content => "...",
:author => "Captain Nemo",
:published_on => Time.now
# Search
Article.search do
query { string 'love' }
facet('timeline') { date :published_on, :interval => 'month' }
sort { by :published_on, 'desc' }
end
Wednesday, February 6, 13 28
29. Tire
ActiveRecord Integration
class Article < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
# Setting
settings :number_of_shards => 3,
:number_of_replicas => 2,
:analysis => {
:analyzer => {
:url_analyzer => {
‘tokenizer’ => ‘lowercase’,
‘filter’ => [‘stop’, ‘url_ngram’]
}
}
}
# Mapping
mapping do
indexes :title, :analyzer => :not_analyzer, :boost => 100
indexes :content, :analyzer => ‘snowball’
end
end
Wednesday, February 6, 13 29