About elasticsearch

Introduce ElasticSearch
Minsoo Jun

Agenda
What is ElasticSearch
ElasticSearch Composition
Understand of ElasticSearch Performance
RDB with ElasticSearch
End

What is ElasticSearch
• Lucene-based open source search engine.
• Inverted Index
• Fast full-text searches.
• Distributed & highly available search engine.
• RESTful search
• Real time search & Analytics
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java.
It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

How does ElasticSearch work?
Compare With RDB
RDB ElasticSearch
Database Indices
Tables Types
Rows Documents
Columns Fields
Index Analyze
Primary key _id
RDB ElasticSearch
Schema Mapping
Physical Partition Shard
Logical Partition Route
Relational Parent/Child, Nested
SQL Query DSL
B*Tree index (Default Index) Inverted Index

Index (inverted index)
Row# Name Address color
1 minsoo Tokyo nerima-ku brown, blue
2 elastic Saitama red, brown
3 search busan blue, yellow
b : y
red:yello
w
blue :
brown
1 3 1 3 2 3
term Row 1 Row 2 Row 3
brown ◉ ◉
blue ◉ ◉
red ◉
yellow ◉
B*Tree Inverted

Composition
Cluster
Node
Indice
Shard
Shard
Shard
Node
Indice
Shard
Shard
Shard
Node
Indice
Shard
Shard
Shard
Index
Type
Document
filed:value
filed:value
filed:value
Type
Document
filed:value
filed:value
filed:value
Type
Document
filed:value
filed:value
filed:value
Physical composition Logical composition

Nodes
node.master : true
Node: Master-eligible
node.data : true
Node: Data
node.ingest : true
Node: Ingest
tribe : *
Node: Tribe
* ElasticSearch 5.X
Cluster-wide Action, Creating or Deleting an Index, Deciding shards
allocate
Handle data related operations like CRUD, Search, Aggregations
There operations are I/O, Memory, CPU-intensive.
Execute pre-processing pipelines
Client across multiple clusters.

Nodes Composition Example
node.master : true
node.data : true
Node: Data
node.ingest : true
Node: Ingest
tribe : *
Node: Tribe
node.master : true
node.master : true
node.data : true
Node: Data
node.data : true
Node: Data
node.data : true
Node: Data
node.data : true
Node: Data
node.data : true
Node: Data
node.data : true
Node: Data
node.ingest : true
Node: Ingest
Cluster A
Cluster B
Node.xxxx: false
Node: coordinating

Shard replication
POST /my_index/_settings
{
“number_of_replicas”: 1
}
POST /my_index/_settings
{
“number_of_replicas”: 2
}

Creating, indexing and deleting a dcoument
1. The client sends a create, index, or
delete request to Node 1.
2. The node uses the document’s _id to
determine that the document belongs to
shard 0. It forwards the request to Node 3,
where the primary copy of shard 0 is
currently allocated.
3. Node 3 executes the request on
the primary shard. If it is successful,
it forwards the request in parallel to the replica
shards on Node 1 and Node 2. Once all of
the replica shards report success, Node 3
reports success to the coordinating node,
which reports success to the client

Retrieving a Document
1. The client sends a get request to Node 1.
2. The node uses the document’s _id to
determine that the document belongs to
shard 0. Copies of shard 0 exist on
all three nodes. On this occasion,
it forwards the request to Node 2.
3. Node 2 returns the document to Node 1,
which returns the document to the client.

Query Phase
1.The client sends a search request to Node 3,
which creates an empty priority queue of size
from + size.
2. Node 3 forwards the search request to
a primary or replica copy of every shard in
the index. Each shard executes the query locally
and adds the results into a local sorted priority
queue of size from + size.
3. Each shard returns the doc IDs and sort
values of all the docs in its priority queue
to the coordinating node, Node 3, which merges
these values into its own priority queue to
produce a globally sorted list of results.
GET /_search
{
"from": 90
, "size": 10
}

Fetch Phase
1. The coordinating node identifies which
documents need to be fetched and issues
a multi GET request to the relevant shards.
2. Each shard loads the documents and enriches
them, if required, and then returns
the documents to the coordinating node.
3. Once all documents have been fetched,
the coordinating node returns the results to
the client.

Composition & Shard tips
Number_of_shards >= number_of_data_nodes
Shard design
Number_of_replica <= number_of_data_nodes -1
Shard sizing
Max number of shards per the Index : >= 200
Max a shard size : 20 ~ 50 GB
Min a shard size : ~ 3 GB
System settings
ulimit –n 65536
permanently /etc/security/limits.conf
Virtual memory
sysctl –w vm.max_map_count=262144
permanently /etc/sysctl.conf
Disable swapping
Bootstrap.memory_lock: true
config/elasticsearch.yml
Number of threads
ulimit –u 2048
permanently /etc/security/limits.conf
jvm.options
ES_JAVA_OPTS=“-Xms2g –Xmx2g”
Max memory must be under half number of OS memory

Understand of the ElasticSearch Performance
Performance keys
Equipment perspective Document (data) perspective Service perspective
Network Bandwidth ?
Disk I/O ?
RAM ?
CPU cores ?
Document size ?
Total Index data size ?
Data size increase ?
Store period ?
Analyzer ?
Analyze fields ?
Indexed field size ?
Boosting ?
Realtime or batch ?
Queries ?

How to connect to RDB
Logstash
input {
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.36-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "mysql" parameters => { "favorite_artist" => "Beethoven" }
schedule => "* * * * *" statement => "SELECT * from songs where artist = :favorite_artist"
Timing
* 5 * 1-3 *

Analysis
Analysis & Analyzer
"The QUICK brown foxes jumped over the lazy dog!"
Analysis
[ quick, brown, fox, jump, over, lazy, dog]
Tokenizer (n-gram)
[ qu, ui, ic, ck]
Token filter
[ QU, ui, ic]
Character filters
[٠١٢٣٤٥٦٧٨٩] [0123456789]
Analyzer

Analysis
Analyzer & Plugin for Japanese
Tokenizer
Standard Tokenizer The standard tokenizer divides text into terms on word boundaries
NGram Tokenizer The ngram tokenizer can break up text into words when it encounters any of a list of
specified characters
Keyword Tokenizer The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and
outputs the exact same text as a single term
Pattern Tokenizer The pattern tokenizer uses a regular expression to either split text into terms whenever
it matches a word separator, or to capture matching text as terms.
Plugin
Kuromoji Plugin The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into
elasticsearch.
Kuromoji analyzer kuromoji_tokenizer
Kuromoji token filter kuromoji_baseform, kuromoji_part_of_speech, cjk_width, ja_stop, kuromoji_stemmer ,
lowercase

END
{
“name” : “minsoo.jun”,
“email” : “minsoo.jun@rakuten.com”
“department” : “TRVDD”,
“group” : “Search Platform”
“language” : [“java”,”ansible”,”SQL”,”korean”],
“database”: [”oracle”,”elasticsearch”,”mongodb”]
}

About elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to About elasticsearch

Similar to About elasticsearch (20)

Recently uploaded

Recently uploaded (20)

About elasticsearch

Editor's Notes