2. About
● A Father, A Husband, A Software Engineer
● Founder of Vietnamese Elasticsearch Community
● Author of Vietnamese Elasticsearch Analysis Plugin
● Technical Consultant at Sentifi AG
● Co-Founder at Krom
● Follow me @duydo
6. Elasticsearch is a distributed
search and analytics engine,
designed for horizontal
scalability with easy
management.
7. Basic Terms
● Cluster is a collection of nodes.
● Node is a single server, part of a
cluster.
● Index is a collection of shards ~
database.
● Shard is a collection of
documents.
● Type is a category/partition of an
index ~ table in database.
● Document is a Json object ~
record in database.
21. Index a Document with ID
PUT /employees/employee/1
{
“name”: “Duy Do”,
“email”: “duy.do@sentifi.com”,
“dob”: “1984-06-20”,
“country”: “VN”
“gender”: “male”,
“salary”: 100.0
}
22. Index a Document without ID
POST /employees/employee/
{
“name”: “Duy Do”,
“email”: “duy.do@sentifi.com”,
“dob”: “1984-06-20”,
“country”: “VN”
“gender”: “male”,
“salary”: 100.0
}
31. Working with Null values
GET /employees/employee/_search
{
“query”: {
“filtered”: {
“filter”: {
“exists”: {“field”: “email”}
}
}
}
}
SELECT * FROM employee WHERE email IS NOT NULL;
32. Working with Null Values
GET /employees/employee/_search
{
“query”: {
“filtered”: {
“filter”: {
“missing”: {“field”: “email”}
}
}
}
}
SELECT * FROM employee WHERE email IS NULL;
40. Boosting Query Clauses
GET /employees/employee/_search
{
“query”: {
“bool”: {
“must”:[{“term”: {“gender”: “female”}}], # default boost 1
“should”:[
{“term”: {“country”: {“query”:“VN”, “boost”:3}}} # the most
important
{“term”: {“country”: {“query”:“US”, “boost”:2}}} # important than
#1 but not as important as #2
],
}
}
}
43. Aggregations
Analyze & Summarize
● How many needles in the
haystack?
● What is the average length of
the needles?
● What is the median length of
the needles, broken down by
manufacturer?
● How many needles are added
to the haystacks each month?
● What are the most popular
needle manufacturers?
● ...
44. Buckets & Metrics
SELECT COUNT(country) # a metric
FROM employee
GROUP BY country # a bucket
GET /employees/employee/_search
{
“aggs”: {
“by_country”: {
“terms”: {“field”: “country”}
}
}
}
45. Bucket is a collection of
documents that meet certain
criteria.
46. Metric is simple mathematical
operations such as: min, max,
mean, sum and avg.
47. Combination
Buckets & Metrics
● Partitions employees by
country (bucket)
● Then partitions each country
bucket by gender (bucket)
● Finally calculate the average
salary for each gender bucket
(metric)
51. Indexing
● Use bulk indexing APIs.
● Tune your bulk size 5-10MB.
● Partitions your time series data
by time period (monthly, weekly,
daily).
● Use aliases for your indices.
● Turn off refresh, replicas while
indexing. Turn on once it’s done
● Multiple shards for parallel
indexing.
● Multiple replicas for parallel
reading.
52. Mapping
● Disable _all field
● Keep _source field, do not store
any field.
● Use not_analyzed if possible
53. Query
● Use filters instead of queries
if possible.
● Consider orders and scope of
your filters.
● Do not use string query.
● Do not load too many results
with single query, use scroll
API instead.